Handle Missing Values

Finds rows containing empty values and handles them according to the chosen strategy.

Also returns a Transformer that can be later applied to another DataFrame using a Transform operation.

Since: Seahorse 0.4.0

Input

Port Type Qualifier Description
0 DataFrame The DataFrame to process.

Output

Port Type Qualifier Description
0 DataFrame The output DataFrame with processed missing values.
1 Transformer The Transformer that allows to apply the operation on other DataFrames using Transform.

Parameters

Name Type Description
columns MultipleColumnSelector Columns to process. If one of the columns is selected more than once (e.g. by name and by type) it will be included only once. When a column selected by name or by index does not exist the operation will fail at runtime with ColumnsDoNotExistException.
strategy Single Choice Strategy for handling missing values in the data.
Possible values: ["remove row", "remove column", "replace with custom value", "replace with mode"]
missing value indicator Single Choice When set to "Yes", a missing value indicator is added for each column in the selected column range. Newly generated columns contain true if the value was missing and false otherwise. The names of generated columns are constructed by prepending the given indicator column prefix to the original column name.
Possible values: ["Yes", "No"]
value String Available only if strategy is set to "replace with custom value". It contains a replacement for the missing values. The replacement value should match selected columns' type. Boolean values are represented as true or false. Timestamps are represented in yyyy-[m]m-[d]d hh:mm:ss[.f...] format. Example timestamp: 2015-03-30 15:25:00.0.
empty column strategy Single Choice Available only if strategy is set to "replace with mode". It defines whether to remove or retain columns, which contain only empty values. Possible values: ["remove", "retain"]
indicator column prefix String Available only if missing value indicator is set to "Yes". It defines the prefix for generated missing value indicator columns.
user-defined missing values Parameters Sequence The sequence of user-defined missing values. Provided value will be cast to all chosen column types if possible, so for example a value -1 might be applied to all numeric and string columns.

Available Strategies

Name Description
remove row Removes all rows containing at least one missing value in the selected column range.
remove column Removes columns with at least one missing value. Only the columns from the selected column range are affected.
replace with custom value Replaces empty values with custom value (within selected column range).
replace with mode Replaces empty values within selected column range with the mode (most frequently occurring value in a column).

Example

Parameters

Name Value
columns Selected columns: by name: ["baths", "price"].
strategy remove row
missing value indicator No
user-defined missing values User-defined missing values: ["-1.0"]

Input

city beds baths sq_ft price
2.0 1.0 820.0 449178.0
CityC null 1.0 656.0 267975.0
CityA 2.0 null 636.0 348946.0
CityA 2.0 1.0 736.0 NaN
CityC 2.0 -1.0 564.0 264867.0

Output

city beds baths sq_ft price
2.0 1.0 820.0 449178.0
CityC null 1.0 656.0 267975.0