The accuracy of predictions made out of Machine learning depends on data, algorithm, and computational power. In this article, we will discuss the challenges faced in the field of machine learning, and how to work around those challenges.
Insufficient Quantity of Training Data
Present day machine learning algorithms, hugely depend on the quantity of data to build a good model, that can predict with a reasonable quantity of data. Accuracy is directly proportional to the quantity of data.
Non-representative Training Data
This scenario occurs, when an ML algorithm is used to predict a particular case, while the training data set itself did not have anything related to the case. An ML algorithm works well if there is a distribution of data over multiple cases. Training with these types of datasets can result in sample bias, which is quite similar to election polls by biased media channels. the end result will be wrong and sometimes exactly the opposite of what you were looking for in the first place.
Erratic data will generate Erratic Learning, which will cause erratic predictions. Poor quality of data can result out of errors, outliers, noise, etc.. which make pattern detection harder.
As the saying goes: garbage in, garbage out. Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process, called feature engineering, involves – Feature selection: selecting the most useful features to train on among existing features. – Feature extraction: combining existing features to produce a more useful one
Outfitting the Training Data
Outfitting occurs out of a selection of algorithms, which behave in a certain way on a dataset, outcomes of which are way off the mark. If an algorithm decides to predict data similar to in or else clause because of one few data sets, that deviate significantly, yet causes erratic predictions.
Complex models such as deep neural networks can detect subtle patterns in the data, but if the training set is noisy, or if it is too small (which introduces sampling noise), then the model is likely to detect patterns in the noise itself. Obviously, these patterns will not generalize to new instances. Obviously, these patterns occurred in the training data by pure chance, but the model has no way to tell whether a pattern is real or simply the result of noise in the data.
Constraining a model to make it simpler and reduce the risk of overfitting is called regularization. You want to find the right balance between fitting the data perfectly and keeping the model simple enough to ensure that it will generalize well.
The amount of regularization to apply during learning can be controlled by a hyperparameter. A hyperparameter is a parameter of a learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training. If you set the regularization hyperparameter to a very large value, you will get an almost flat model (a slope close to zero); the learning algorithm will almost certainly not overfit the training data, but it will be less likely to find a good solution. Tuning hyperparameters is an important part of building a Machine Learning system
Underfitting the Training Data
underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data. For example, a linear model of life satisfaction is prone to underfit; reality is just more complex than the model, so its predictions are bound to be inaccurate, even on the training examples.
The main options to fix this problem are:
- Selecting a more powerful model, with more parameters
- Feeding better features to the learning algorithm (feature engineering)
- Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)
Testing and Validating
The only way to know how well a model will generalize to new cases is to actually try it out on new cases. One way to do that is to put your model in production and monitor how well it performs. This works well, but if your model is horribly bad, your users will complain — not the best idea.
A better option is to split your data into two sets: the training set and the test set. As these names imply, you train your model using the training set, and you test it using the test set. The error rate on new cases is called the generalization error (or out-of-sample error), and by evaluating your model on the test set, you get an estimation of this error. This value tells you how well your model will perform on instances it has never seen before.
If the training error is low (i.e., your model makes a few mistakes on the training set) but the generalization error is high, it means that your model is overfitting the training data.
It is common to use 80% of the data for training and hold out 20% for testing.
So evaluating a model is simple enough: just use a test set. Now suppose you are hesitating between two models (say a linear model and a polynomial model): how can you decide? One option is to train both and compare how well they generalize using the test set.
Now suppose that the linear model generalizes better, but you want to apply some regularization to avoid overfitting. The question is: how do you choose the value of the regularization hyperparameter? One option is to train 100 different models using 100 different values for this hyperparameter. Suppose you find the best hyperparameter value that produces a model with the lowest generalization error, say just 5% error.
So you launch this model into production, but unfortunately, it does not perform as well as expected and produces 15% errors. What just happened? The problem is that you measured the generalization error multiple times on the test set, and you adapted the model and hyperparameters to produce the best model for that set. This means that the model is unlikely to perform as well on new data. A common solution to this problem is to have a second holdout set called the validation set. You train multiple models with various hyperparameters using the training set, you select the model and hyperparameters that perform best on the validation set, and when you’re happy with your model you run a single final test against the test set to get an estimate of the generalization error. To avoid “wasting” too much training data in validation sets, a common technique is to use cross validation: the training set is split into complementary subsets, and each model is trained against a different combination of these subsets and validated against the remaining parts. Once the model type and hyperparameters have been selected, a final model is trained using these hyperparameters on the full training set, and the generalized error is measured on the test set.