Dec 03, 2024

Wiki

Python

Aide

edit SideBar

Search

Data Preparation

Before embarking on a regression, the first step is to visualize your dataset, and look for outliers or missing data. The first ones should be discarded, by one of the methods we will explain later. The second ones are to be completed, for example by 0, or by the average of the data of the column, or by making a linear interpolation between the two closest known values, if it makes sense.

In practice, this can be done by using panda's fillna.

Once these data are prepared, they must be encoded, which often helps the model to make good predictions. Columns with numerical values must be normalized, and other columns (consisting of text, booleans, colors... in short categories), must be encoded as a number or a vector of numbers. This can be done in various ways. We can for example associate an integer to each category, in a naive way.

We can also do a "one hot encoding": if we have n possible categories, then we will encode it as a binary vector. Or, finally, transform each category into a real corresponding to the target average for that category ("target encoding"), which is a relevant and condensed way to do this task. Other approaches exist, which can be found in the library TargetEncoder.

  • Data normalization can be done by using sklearn's StandardScaler.
  • Labels can be basically encoded with LabelEncoder, or with OneHotEncoder for an one hot encoding.
  • Target encoding: If, for example, we consider the day in the year, rather than collecting 365 binary variables, we recover a single variable, whose value will be greater the higher the target variable is on that day. We start by installing the following module:
    pip install category_encoders

then :

  from category_encoders import TargetEncoder
  enc = TargetEncoder(cols=['Name_of_col','Another_name'])
  training_set = enc.fit_transform(X_train, y_train)

other ways to encode categorical variables can be found in this module: here.

Page Actions

Recent Changes

Group & Page

Back Links