Great on article on the difference between L1 and L2 regularization.
I find this to be a pretty complex topic, but I think that this article explains the differences very intuitively. http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/ What is the simple intuition? I am by no means an expert, but here is some basic intuition for why: 1. L1 regularization (Least absolute errors) produces sparse solutions, and therefore has built in feature selection 2. L2 regularization (Least squares error) does not Suppose you had 6 weights, and your L1 regularization term had a choice between a few sparse weights, or many smaller weights: L1 = |0|+|0|+|0|+|-5|+|0|+|1.4| = 4.8 OR L1 = |1.2|+|1.3|+|.8|+|2.4|+|1.8|+|1.4| = 8.9 In this case, your optimization algorithm would converge towards fewer sparse weights because of the absolute value term. Now lets take a look at the situation with the L2-norm: L2 = 0^2 +0^2+0^2+-5^2+0^2+1.4^2 = 26.96 OR L2 = 1.2^2+1.3^2+.8^2+2.4^2+1.8^2+1.4^2 = 14.73 ...