luwei likes data science

Posts

Showing posts from 2013

Prime Number Algorithms with R.

By Luwei March 22, 2013

I've recently been read up on prime number theory, which has inspired me to investigate some prime number algorithms myself. First of all, the most obvious way to solve for prime numbers is the brute force method. I.e if you are evaluating whether a number (n) is prime, check if it can be factorized by integers from 2 to n-1. If not, then it is a prime. Needless to say, this is the most inefficient method of finding prime numbers. A more efficient algorithm for finding primes is the Sieve of Eratosthenes. The way that it works is very intuitive. You basically only evaluate primality on numbers less than the square root of n, which allows you to rule out all multiples of those prime numbers. After you have done this, what you are left with is only the prime numbers up to the number n. So why only up to the square root of n? How can you be sure that integers above the square root of n are prime numbers if you don't check them individu...

7-day rolling average within SQL

By Luwei March 22, 2013

In my opinion, the most straightforward way of doing data analysis is to pull the data from SQL, export into Excel, and do the analysis from there. However, Excel 2003 has a limit of roughly 65,000 rows while Excel 2007 and later have limits of about 1 million rows. This is far from enough relative to the billions of rows of data that many companies work with today. Therefore, it is important to know how to manipulate and aggregate data within SQL before exporting it into Excel. For example, suppose you have a table named "user_sessions" structured like the following. Each row represents a distinct user session on a given day. There can only be one user session per customer per day. But a given customer can have more than one user session over multiple days. user_id date 6581 1/17/2012 3933 2/10/2012 10124 2/23/2012 1721 3/1/2012 11608 2/27/2012 11927 3/11/2012 ...

Using recursive SQL to solve the parent-child problem

By Luwei March 22, 2013

I recently came across an interesting SQL problem: Suppose you are given a table with two columns. The first column contains names of parents, and the second column contains names of their children. For example, you have the following: Parent Child Billy Sarah Sarah Casey Cornelius Billy In this example, you have Cornelius as the father of Billy, who is the father of Sarah, who is the Mother of Casey. Therefore, Cornelius is the great grandfather, or ultimate ancestor of Casey. But how would you determine this if you had a table with thousands or millions of rows in length? Would it be possible to write SQL code that generates the name of the ultimate parent in the first column and the corresponding child in the second column? Intuitively, you would want to find instances where the child column has commonalities with the parent column. For example, if you duplicated the above table and joined parent column...