Automatic learning

General presentation

Various types of learning

The automatic learning can be:

Supervised: the training data provided to the algorithm includes the desired solutions (labels). The classic problems of this type of learning are classification and regression.
unsupervised: learning data is not labelled (no label). Problems: partitioning, visualization and dimension reduction, and learning association rules.
Semi-supervised: only some learning data is labelled,
by reinforcement: the learning system (the agent) can observe the environment, select and perform actions, and in return obtain rewards.

The human knows how to do unsupervised, but the machine has more trouble.

Online learning vs. grouped learning

Learning can be done online or in groups.

Online: can learn gradually, from an incoming data stream.
Grouped (batch learning): the system is first trained and then placed in production (without further learning).

In order for a group learning to become aware of new data, a new version of the system must be trained on the entire dataset. This can be done periodically: once a week, with a periodic replacement of the system taking into account updates. Batch learning requires time and resources, which is not always available.

E-Learning (which is often done offline) must be able to adapt and evolve quickly and independently. It does not require storage of the dataset, but can operate on a stream, and do out-of-memory learning for large data. It depends on a learning rate, the rate at which it adapts to changing data. In the event of a high rate, the system adapts quickly to new data... and risks forgetting the old data just as quickly.

One of the major difficulties of e-learning is that if bad data is introduced into the system, it will gradually deteriorate. Possible solution: algorithm for detecting anomalies on input data.

Mode of generalization of learning

Most automatic learning tasks involve making predictions: from a number of learning examples, the system must be able to generalize to examples it has never seen before.

Observational learning: or instance based learning, consists of learning by heart, then using a measure of similarity on new cases (the class of the new case will be the one of the most similar known case).
Learning from a model:
- a model is selected (for example, the level of individual satisfaction depends roughly linearly on GDP), depending on parameters θi (e.g. slope and origin ordinate).
- In order to choose its model, we define a fitness function, or a contrario a cost function (for example, the mean square error between the linear model and the actual data).
- The model training then consists in finding the parameters that best fit the model to the data (e.g. linear regression algorithm).

Difficulties of automatic learning

Data issues

Insufficient learning data

Very different machine learning algorithms, some of which are rather simple, give equally good results on complex problems such as the disambiguation of natural language, provided there is a very large amount of data: this is the unreasonable efficiency of the data.

In short: data is more important than algorithms for complex problems.

Training data are not representative

There may be sampling noise, when the sample is too small, and sampling bias when the sampling method is defective (even on very large samples, see non-response bias).

Poor data quality

Take the time to clean up your data: delete or correct outliers, and manage missing data (delete the variable, ignore the observation, or complete the data).

Non-relevant variables

Garbage in, garbage out: the system can only learn if the training data contains enough relevant variables, and not too many irrelevant ones. You have to choose a good set of variables to train on:

Variable selection: choose, among the available variables, those most useful for training;
variable extraction: combining several existing variables to produce a more useful one (see dimension reduction algorithms);
introduction of new variables.

It is the engineering of variables (feature engineering).

Problems with algorithms

Overadjustment of training data

Do not overgeneralize (overadjustment, or overfitting): the model may work well on learning data, but it may not generalize well. Think of Lagrange interpolation: you go through the interpolation points, but between them you do anything.

Overadjustment occurs when the model is too complex for the amount of learning data and the noise it contains. We can then:

Simplify the model: select fewer parameters (linear rather than polynomial), reduce the number of attributes of training data, impose constraints on the model (regularization);
collect more learning data;
reduce noise.

In short, a good balance must be found between a perfect adjustment of the data, and a simplicity of the model sufficient to guarantee a good generalization. The level of regularization to be applied during learning (e.g. limiting to a slight slope in a linear regression) can be controlled by a hyperparameter: a parameter of the learning algorithm (and not of the model), which remains constant during learning.

Under-adjustment of training data

Underfitting is the opposite of overadjustment: the model is too simple to discover the underlying structure of the data. We can then:

choose a more powerful model, with more parameters;
provide better variables to the learning algorithm, transforming them if necessary;
reduce constraints on the model, for example by reducing the control hyperparameter.

In short, the model must not be too simple or too complex.

Testing and validation

It is necessary to separate the data into training set (80%) and test set (20%).

The generalization error on the test set, if it is high when the learning error is low, indicates an overadjustment.

Two models can be compared through their generalization errors.
Once the model is chosen, some regulation is applied to avoid overadjustment: the value of the hyperparameter can be determined using a grid, choosing the one that gives the best results.
The generalization error may be much higher on the test set, if the hyperparameter was only the best for the learning data. To avoid this, a third set of data, called validation data, can be used only to select the hyperparameter.
To avoid wasting too much training data for validation, a cross-validation technique is used: the training set is divided into random complementary subsets, each serving for a fixed model.
Once the model type and hyperparameters are selected, a final model using these hyperparameters is trained over the entire training set, and the generalization error is measured over the test set.

WikiMath

This is Wordpress NewsPaper Theme converted to PmWiki

Aug 21, 2025

Search