luwei likes data science

Posts

Showing posts from 2014

Great on article on the difference between L1 and L2 regularization.

By louie September 07, 2014

I find this to be a pretty complex topic, but I think that this article explains the differences very intuitively. http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/ What is the simple intuition? I am by no means an expert, but here is some basic intuition for why: 1. L1 regularization (Least absolute errors) produces sparse solutions, and therefore has built in feature selection 2. L2 regularization (Least squares error) does not Suppose you had 6 weights, and your L1 regularization term had a choice between a few sparse weights, or many smaller weights: L1 = |0|+|0|+|0|+|-5|+|0|+|1.4| = 4.8 OR L1 = |1.2|+|1.3|+|.8|+|2.4|+|1.8|+|1.4| = 8.9 In this case, your optimization algorithm would converge towards fewer sparse weights because of the absolute value term. Now lets take a look at the situation with the L2-norm: L2 = 0^2 +0^2+0^2+-5^2+0^2+1.4^2 = 26.96 OR L2 = 1.2^2+1.3^2+.8^2+2.4^2+1.8^2+1.4^2 = 14.73 ...

Simple logistic regression model and ROC Curve with R + intuitive explanation of ROC curve

By louie September 04, 2014

I recently discovered a cool package in R called pROC that is very convenient for making ROC curves. Here is an example of how to implement in R after creating a model: I've included the entire process starting from splitting the data into training and test, fitting the model, validating by predicting the test set, and finally drawing the ROC curve. library(sqldf) library(ggplot2) library(reshape) library(pROC) setwd('/home/willie/data_science/XXXXX') system('ls) ###generate training and test set using 80/20 split######## splitdf <- function(dataframe, seed=NULL) { if (!is.null(seed)) set.seed(seed) index <- 1:nrow(dataframe) trainindex <- sample(index, trunc(length(index)/5)) trainset <- dataframe[-trainindex, ] testset <- dataframe[trainindex, ] list(trainset=trainset,testset=testset) } splits <- splitdf(mydata, seed = 888) training_data <- splits$trainset test_data <- splits$testset ###create a logistic regression model m...

Vectorizing images in Python for Machine Learning image classification

By Luwei February 01, 2014

I'm doing the Astrozoo competition on Kaggle right now. The goal is to create an algorithm to accurately classify galaxies into different types. The training set is a zip file consisting of over 50,000 images. Before you do anything, the first step would be to convert the images into a usable format for analysis, such as resizing each image to 100x100 and converting it into a vector of data points. Here is the code I've pieced together to do that. Running this code will convert all jpeg files in a specified folder into vectors and dump it into a .txt file. #take all files in one directory and move it into another directory import os import glob #allows you to look at file names import PIL #allows you to manipulate images (for resize) #main script variables orig_dir = '/home/luwei/Desktop/Dropbox/Kaggle/astrozoo/testing' #where the images are located crop_dimensions = (140, 140, 284, 284) #dimensions to crop each image by pixelsize = 5 #set s...