Handle Missing Values
Finds rows containing empty values and handles them according to the chosen strategy.
Also returns a Transformer that can be later applied
to another DataFrame using a Transform operation.
Since: Seahorse 0.4.0
Port |
Type Qualifier |
Description |
0 |
DataFrame |
The DataFrame to process. |
Output
Port |
Type Qualifier |
Description |
0 |
DataFrame |
The output DataFrame with processed missing values. |
1 |
Transformer |
The Transformer that allows to apply the operation on other DataFrames using
Transform. |
Parameters
Name |
Type |
Description |
columns |
MultipleColumnSelector |
Columns to process.
If one of the columns is selected more than once (e.g. by name and by type)
it will be included only once. When a column selected by name
or by index does not exist the operation will fail at runtime with ColumnsDoNotExistException . |
strategy |
Single Choice |
Strategy for handling missing values in the data.
Possible values: ["remove row", "remove column", "replace with custom value", "replace with mode"]
|
missing value indicator |
Single Choice |
When set to "Yes" , a missing value indicator is added for each column in the
selected column range. Newly generated columns contain true if the value was
missing and false otherwise. The names of generated columns are constructed by
prepending the given indicator column prefix
to the original column name.
Possible values: ["Yes", "No"]
|
value |
String |
Available only if strategy is set to
"replace with custom value" . It contains a replacement for the missing values.
The replacement value should match selected columns' type. Boolean values are represented
as true or false . Timestamps are represented in
yyyy-[m]m-[d]d hh:mm:ss[.f...] format.
Example timestamp: 2015-03-30 15:25:00.0 .
|
empty column strategy |
Single Choice |
Available only if strategy is set to "replace with mode" .
It defines whether to remove or retain columns, which contain only empty values.
Possible values: ["remove", "retain"]
|
indicator column prefix |
String |
Available only if missing value indicator
is set to "Yes" . It defines the prefix for generated missing value indicator columns.
|
user-defined missing values |
Parameters Sequence |
The sequence of user-defined missing values. Provided value will be cast to all chosen column types if possible,
so for example a value -1 might be applied to all numeric and string columns. |
Available Strategies
Name |
Description |
remove row |
Removes all rows containing at least one missing value in the selected column range. |
remove column |
Removes columns with at least one missing value. Only the columns from the selected column range are affected. |
replace with custom value |
Replaces empty values with custom value (within selected column range). |
replace with mode |
Replaces empty values within selected column range with the mode (most frequently occurring value in a column). |
Example
Parameters
Name |
Value |
columns |
Selected columns: by name: ["baths", "price"]. |
strategy |
remove row |
missing value indicator |
No |
user-defined missing values |
User-defined missing values: ["-1.0"] |
city |
beds |
baths |
sq_ft |
price |
|
2.0 |
1.0 |
820.0 |
449178.0 |
CityC |
null |
1.0 |
656.0 |
267975.0 |
CityA |
2.0 |
null |
636.0 |
348946.0 |
CityA |
2.0 |
1.0 |
736.0 |
NaN |
CityC |
2.0 |
-1.0 |
564.0 |
264867.0 |
Output
city |
beds |
baths |
sq_ft |
price |
|
2.0 |
1.0 |
820.0 |
449178.0 |
CityC |
null |
1.0 |
656.0 |
267975.0 |