Great on article on the difference between L1 and L2 regularization.

I find this to be a pretty complex topic, but I think that this article explains the differences very intuitively.

http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/



What is the simple intuition?

I am by no means an expert, but here is some basic intuition for why:

1. L1 regularization (Least absolute errors) produces sparse solutions, and therefore has built in feature selection
2. L2 regularization (Least squares error) does not

Suppose you had 6 weights, and your L1 regularization term  had a choice between a few sparse weights, or many smaller weights:

L1 = |0|+|0|+|0|+|-5|+|0|+|1.4| = 4.8
OR 
L1 = |1.2|+|1.3|+|.8|+|2.4|+|1.8|+|1.4| = 8.9

In this case, your optimization algorithm would converge towards fewer sparse weights because of the absolute value term.

Now lets take a look at the situation with the L2-norm:

L2 = 0^2 +0^2+0^2+-5^2+0^2+1.4^2 = 26.96
OR 
L2 = 1.2^2+1.3^2+.8^2+2.4^2+1.8^2+1.4^2 = 14.73

Now, your optimization algorithm would prefer the smaller but non-sparse weights.

Thus, you can see how L1 prefers sparse weights, but L2 prefers smaller but non-sparse weights.


Additional Resources:

Great Quora answers to this question: https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization

Why does the l1 norm produce sparse solutions?: https://medium.com/mlreview/l1-norm-regularization-and-sparsity-explained-for-dummies-5b0e4be3938a

Again, the article from above: http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/



Comments

Popular posts from this blog

grandmaster level chess AI using python - Part 2 (the code)

building a chess ai - part 4: learning an evaluation function using deep learning (keras)

Brief intro to recurrent neural networks