Split

Splits a DataFrame into two separate DataFrames. Each row from the input DataFrame will always end up in one of the result DataFrames, but never in both.

There are two split modes:

The Split operation does not preserve row order.

Since: Seahorse 0.4.0

Input

Port Type Qualifier Description
0 DataFrame The DataFrame to split.

Output

Port Type Qualifier Description
0 DataFrame The first part of the input DataFrame.
1 DataFrame The second part of the input DataFrame.

Parameters

Name Type Description
split mode Single Choice The split mode. Possible values are: RANDOM, CONDITIONAL.
split ratio Numeric Valid only if split mode = RANDOM. A number between 0 and 1 describing how much of the input DataFrame will end up in the first part of the split. Example: split ratio = 0.3 means that the first output DataFrame will contain about 30% of the rows of the input DataFrame, and the second output DataFrame will contain the rest (about 70%) of the rows of the input DataFrame.
seed Numeric Valid only if split mode = RANDOM. An integer between -1073741824 and 1073741823 that is used as a seed for the random number generator. A fixed value of this parameter allows to produce repeatable results.
condition Code Snippet Valid only if split mode = CONDITIONAL. The split condition. Rows satisfying given condition will be included into first output DataFrame and rows not satifying it will be included into second output DataFrame. It should be Spark SQL condition (as used in WHERE condition).

Example

Parameters

Name Value
split mode RANDOM
split ratio 0.2
seed 0.0

Input

city beds price
CityA 4.0 695611.0
CityC 2.0 294691.0
CityB 3.0 430784.0
CityB 2.0 336677.0
CityA 3.0 584639.0
CityA 4.0 579560.0

Output

Output 0

city beds price
CityA 4.0 695611.0
CityB 2.0 336677.0
CityA 3.0 584639.0

Output 1

city beds price
CityC 2.0 294691.0
CityB 3.0 430784.0
CityA 4.0 579560.0