What is Regularization and Why regularization reduces overfitting?
Regularization
If you suspect your neural network is overfitting your data i.e. you have high variance problem, one of the first thing you should try is regularization. The other way to address high variance is to get more training data that’s also quite relatable. But you can’t always get more training data or it could be expensive to get more training data. But adding regularization will often helps to prevent overfitting or to reduce the errors in your network.
How regularization works?
Let’s say I’m using logistic regression, so my cost function is defined as:
To add regularization to the logistic regression, you add \Lambda which is called the regularization paramter.
Now why do we regularize just the parameter w? why don’t you add something here about b i.e. bias as well? In practice you could do this but I actually omit this. Because if you look at your parameters, w is usually a pretty high dimensional parameter vector, especially with a high variance problem. Maybe w just has a lot of parameters, so you aren’t fitting all the parameters well, whereas b is just a real number. So almost all the parameters are in w rather than b and if you add this last term in practice, it won’t make much of a difference because b is just one parameter over a very large number of parameters.
Why regularization reduces overfitting?
The Lambda penalizes the weight matrices from being too large. One piece of intuition is that if you crank regularization Lambda to be really, really big, they will be incentivized to set the weight matrices W to be reasonably close to zero. So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that’s basically zeroing out a lot of the impact of these hidden units. And if that’s the case, then this much simplified neural network becomes a much smaller neural network.
The intuition of completely zeroing out a bunch of hidden units isn’t quite right. It turns out that what actually happens is they’ll still use all the hidden units, but each of them would just have a much smaller effect. But you do end up with a simpler network and as if you have a smaller network that is therefore less prone to overfitting.