Before embarking on a regression, the first step is to visualize your dataset, and look for outliers or missing data. The first ones should be discarded, by one of the methods we will explain later. The second ones are to be completed, for example by 0, or by the average of the data of the column, or by making a linear interpolation between the two closest known values, if it makes sense.
In practice, this can be done by using panda's fillna.
Once these data are prepared, they must be encoded, which often helps the model to make good predictions. Columns with numerical values must be normalized, and other columns (consisting of text, booleans, colors... in short categories), must be encoded as a number or a vector of numbers. This can be done in various ways. We can for example associate an integer to each category, in a naive way.
We can also do a "one hot encoding": if we have n possible categories, then we will encode it as a binary vector. Or, finally, transform each category into a real corresponding to the target average for that category ("target encoding"), which is a relevant and condensed way to do this task. Other approaches exist, which can be found in the library TargetEncoder.
then :
from category_encoders import TargetEncoder enc = TargetEncoder(cols=['Name_of_col','Another_name']) training_set = enc.fit_transform(X_train, y_train)
other ways to encode categorical variables can be found in this module: here.