Supervised learning consists of associating a label with measures:
Classification can be done, for example, with KNN, SVM and decision trees, when regression can be done through linear regression, SVM, and decision tree forests, see below. Finally, regressions can be univariate or multivariate.
Their quality is measured:
Note that:
Example of regression: Determine the distance of vehicles according to the brightness of their headlights.
Generally speaking, a "regression model" is any model designed to predict the value of a variable, called a "variable to explain", as a function of one or more other variables, called "explanatory variables". Various types of regression models exist: linear, logistic, ordered logistic...
Linear regression is the oldest and most common regression model, dating back to the 1750s, and is used when the variable to be explained is quantitative.
It is based on the assumption that the variable to be explained is equal to a linear combination of the explanatory variables, plus unexplained variations called noise (or errors, or residues). The mathematical formulation of this model is therefore:
$Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + ... \beta_p X_{i,p} + \epsilon_i$
where:
$\beta_0, \beta_1, ..., \beta_p$ are therefore the parameters of the model, to be determined.
The following matrix writing can also be used:
$Y = \beta X + \epsilon$
where:
In general, it is assumed that the noise $\epsilon$ follows a normal centered law $N(0,\sigma^2 I)$ in which the variance $\sigma^2$ is to be determined. In this case, maximizing the likelihood of the model is equivalent to minimizing the sum of the squares of the components of $\epsilon$ (also called the sum of the error squares). The least squares method is then used to estimate the model parameters.
>>> import sklearn >>> lin_reg_model = sklearn.linear_model.LinearRegression() >>> lin_reg_model = fit(X,y) >>> lin_reg_model.predict(X_new)
Logistic regression is a regression model that applies when the variable to be explained is binary (sick or healthy individual, living or deceased, etc.).
The main hypothesis of the logistic regression is that the state of the variable to be explained $Y$ depends on a continuous variable $Y^*$ (unobserved), also called "latent trait". A linear regression can then be applied to this latent trait:
$Y_i^* = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + ... + \beta_p X_{i,p} + \epsilon_i$.
Here $\epsilon$ is supposed to follow a standard logistic law: a normal law approximation that has the advantage of having an explicitly defined distribution function.
The logistic regression assumption is that $Y_i = 0 \Leftrightarrow Y_i^* < $0 (and therefore $Y_i = 1 \Leftrightarrow Y_i^* \geq 0$). As a result, it follows that
$P(Y_i = 1) = \Phi(\beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + ... + \beta_p X_{i,p})$
where $\Phi : x \rightarrow \frac{1}{1 + e^{-x}}$ is the standard logistic law distribution function. In other words, the probability that $Y_i$ is equal to 1 is all the greater as $\beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} +... + \beta_p X_{i,p}$, also called the linear predictor, is large.
These are trees such that, at each node, a question is asked, to reduce the set of remaining solutions to 2 separate parts as large as possible. This process is recursively reproduced to a single solution, according to a dichotomy, median principle.
In automatic learning, decision trees are built by the algorithm:
The algorithm is all the more efficient because it finds the parameters that maximize sharing at each node.
rem: the order in which the predictor parameters are selected influences the result.
Random forests consist of learning from multiple decision trees, working on the most independent subsets of data possible.
This approach solves several decision tree problems, such as the impact of the order of predictor parameters, or complexity.
The KNN algorithm searches for the nearest N neighbours (by calculating the distance) between the data to be predicted and the known data. It returns the class of the majority of neighbors.
Relatively simple, the KNN does not calculate any information in the learning process. However, it is not suitable for large data.
>>> import sklearn >>> knn_model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3) >>> knn_model = fit(X,y) >>> knn_model.predict(X_new)
This is an extension of linear regressors, adapted to data with more twisted separations. These machines can work with data with a large number of parameters.
There are several types of neural networks.
Example of a multilayer perceptron with scikit-learn
>>> from sklearn.linear_model import Perceptron >>> per = Perceptron() >>> per.fit(X, y) # two np.arrays >>> y_pred = per.predict([[ 2, 0.5] ])