This article will cover the fundamental concepts and principles which are central to all the machine learning problems & will likely to be useful in each of those, proper understanding of these concepts is necessary to address the central issue in machine learning which is — “How to generalise well on unseen data by learning from the finite amount of data we have.”
Let’s start with — Model Selection & Model Evaluation.
Model Selection is the process of choosing between the different learning algorithms for modelling our data, for solving a classification problem the choices could be made between Logistic Regression, SVM, Tree-based algorithms etc. And for a regression problem decisions also need to be made for the degree of linear regression algorithms.
Model Evaluation aims to check the generalization ability of our model, i.e ability of our model to perform well on an unseen dataset. There are different strategies for evaluating our model, which we will cover in the next part of this article.
Model evaluation is the process of checking the model performance to see how much our model is able to explain the data whereas model selection is the process of seeing the level of flexibility we need for describing the data.
Having two hypotheses (here, decision boundaries) that has the same empirical risk (here, training error), a short explanation (here, a boundary with fewer parameters) tends to be more valid than a long explanation.
You can consider Occam’s razor principle in machine learning as a thumb rule which says that whenever you have multiple choice for a decision boundary, always choose the simpler one.
But what does choosing a simple model means?
The simplicity of a model often understood by defining what the complexity of a model means. Let’s understand them by taking an analogy.
Assume there are two students in a class Student-A and Student-B and let’s consider they have different ways of learning as mentioned below:
Student A is interested in grasping the concepts, his high-level picture of subjects are clear and he is comfortable in explaining the concepts in simple terms. But he does not memorise, not even a simple formula.
Student B is interested in scoring more, he prefers memorization over understanding, he solves all the problems from multiple books.
We can say that they have different mental models for learning.
As per their different style of learning, Student A will be good at solving new unseen problems than Student B whereas Student B is good at scoring more as he memorised most of the problems/formulas, here Student A will struggle to score(consider exams are time-bounded and how he will solve questions on time while deriving the formulas instead he could have memorised it to save time).
An ideal student will be one whose learning style is somewhere in between Student A & Student B so that he could score decently and will also be able to solve unseen problems.
This same thing happens with machine learning models also, if our model is too simple then it hasn’t learned enough patterns to give you a good score, but we can achieve good score with a complex model which has memorised all the patterns but this complex model won't be able to generalise well on unseen data. In machine learning, we call this phenomenon as overfitting(we will soon come to this topic)
I hope you must have got an idea of how simple and complex models behave, but technically when can we say a model is simple or complex, or what factors govern their complexity?
Below are some of the examples in case of linear regression:
We already got a fair amount of idea about what overfitting of a model means, overfitting occurs when a model becomes more complex by learning the specifics/noise in data and as a result it failed to generalise well on unseen data set. In our analogy, you can say Student B is the case of overfitting. Simply more the complexity of a model means more the chance of overfitting.
An extreme overfitted model will learn perfectly from a given dataset, or we can say it memorise all the dataset. It will give good accuracy on training dataset i.e on the data which it has already seen but clearly it will fail badly on an unseen dataset. Also, a model which is overfitted is said to have low bias and high variance, let’s see these terms in detail to understand how we can handle overfitting.
What is Bias?
Bias tells us the ability of our model to predict the values/labels correctly, if our model’s prediction is far from the target that means our model has high bias. In other words, our model is simple enough to learn the patterns in the given dataset and hence model’s prediction is nowhere near to target value/labels. You can also think of it as a difference between the average predictions made by our model and the actual target of a model.
What is Variance?
Variance here refers to the degree of changes in the model itself with respect to changes in the training data. That means the model needs to change its internal representation to incorporates the pattern in new training data. High variance in the model makes it unstable and sensitive to the small changes in the dataset. This clearly happens in the case of a model becoming more complex, i.e overfitted model will have high variance.
Underfitted model will have high bias & Overfitted model will have high variance
Below image is one of the best visual representation of what does Bias & Variance refers to.
Notice in the below image how bias & variance changes with models complexity.
Low variance in a model means it is too simple to capture the pattern and low bias means it is complex and has captured the patterns, when model’s complexity increases the bias starts to go down and variance starts to go up, that’s why there is a tradeoff between them and there must be a balance to achieve the desired generalisation ability. Do note the optimal model complexity in the above image.
Till now we have seen what is Occam's razor and what it means by simple and complex model, how complex model leads to overfitting and we also saw that there must be a tradeoff between the bias & variance. But we don’t know yet how to build our model keeping these things in mind, let’s see it next.
If we are talking specifically for regression problems then there are two ways to reduce overfitting of a model
Regularization is the process through which we tune the error function by adding an extra regularization term. This term controls the magnitude of coefficients such that they don’t take extreme values.
Here we try to reduce the complexity of the regression function without actually reducing the degree of the underlying polynomial function.
This technique is based on the fact that if the highest order terms in a polynomial equation have very small coefficients, then the function will approximately behave like a polynomial function of a smaller degree.
How it regularises the model?
Penalty term in a cost function penalises the model heavily when the model is becoming complex. Let’s see the basic idea behind L1(Lasso) and L2(Ridge) regression where regularization is done by adding different penalty terms.
Above you can see we are modifying the cost function and now this function needs to be minimised, λ here is a hyperparameter which needs to be tuned, intuitively think of it as a term which controls how much you want to penalise the model. Also, see we are adding summation of square of parameters here and whenever a model is learning parameters which are large i.e it is becoming complex then this will ensure to discourage our model to learn large parameters. This is how we avoid our model to become complex.
This is the cost function for Lasso regression, here we are adding summation of absolute values of model parameters. It will also work similarly by not allowing the model to learn large parameters.
I will highly recommend going here for interactive visualization, explanations are up to the point and it will help you in getting more clarity.
Refer links in “Further Reading” section for more details on Ridge & Lasso regression.
We have seen only one way of controlling the overfitting, in the next article we will start with another approach — “Hyperparameter Tuning”.