Seahorse supports Data Sources of the following types:
Read more about supported file formats.
Reading data from and writing data to JDBC-compatible databases is supported.
This functionality requires placing adequate JDBC driver JAR file to Seahorse shared folder jars
.
That file placement has to be performed before starting editing workflow that uses JDBC connection
(otherwise, it will be required to stop running session and start it again).
Google Sheets Data Source has two parameters that require more detailed description:
In the following sections you will learn how to set up a Google Service Account for the Seahorse instance and how to share the spreadsheets with the Seahorse instance.
Obtain e-mail address of your Google Service Account from the list of Service Accounts
Share your Google Spreadsheet with your Google Service Account using e-mail address from step 1.
You can use Google Spreadsheet and you Google Service Account credentials to define a Google Spreadsheet Data Source in Seahorse.
Now it’s ready to use in the Seahorse!
CSV
When reading a CSV file, Seahorse infers column types. If a column contains values of multiple types, the narrowest possible type will be chosen, so that all the values can be represented in that type.
Empty cells are treated as null
, unless column type is inferred as a String
- in this
case, they are treated as empty strings.
If the convert to boolean
mode is enabled, the columns that contain only zeros, ones or empty values will be
inferred as Boolean
.
In particular, a column consisting of empty cells will be inferred as Boolean
with null
values only.
While reading, Seahorse assumes that each row in the file has the same number of fields. When this condition is not met, the behavior is undefined.
If the file defines column names, they will be used in the DataFrame
.
If column’s name is empty or absent, it will be named unnamed_X
,
where X
is the smallest non-negative number such that column names are unique.
You can escape a column separator with a backslash. For example, assuming that comma is the separator, the following line
1,abc,"a,b,c","\"x\"",, z ," z "
will be parsed as:
1.0
abc
a,b,c
"x"
_z_
_z__
where _
denotes a space and the fifth value is an empty string. Note, that "\"x\""
is being
parsed as "x"
, since \"
inside an already quoted value translates to "
.
PARQUET
Note that Parquet
format does not allow using any of the characters , ;{}()\n\t=
in column names.
JSON
Note that JSON
file format does not preserve the order of columns.
When saving a DataFrame
, Seahorse converts Timestamp
columns to String
type
(values of that columns are converted to their string representations by Apache Spark).
Null
values in JSON are omitted. This might result in schema mismatch if all values in particular
column are null
(that column will be omitted in output JSON file).