About Me

My photo
Just wasting time sharing knowledge about, Big Data and Analytics

Jun 9, 2013

How to read quickly large dataset in R?


Here, or there, I read many techniques to import a large dataset in R.
The option read.table or read.csv doesn't work anyway because, as discusshere, R load in memory. And sometimes, when we try to load a big dataset, we got this message :

Warning messages: 
1: Reached total allocation of 8056Mb: see help(memory.size)
2: Reached total allocation of 8056Mb: see help(memory.size) 

Many techniques can be used to load a large dataset. I found some there, or there. But there is two techniques that I never think before. 
Suppose that we have a large dataset with 10 millions rows

Comparing the methods for loading in R. 
- Using read.table

read.csv() performs a lot of analysis of the data it is reading, to determine the data types. So we can help R, by reading the first rows, determine the data type of the columns, and then, read the big data and provide the type of each columns and/or squeeze some of them if it doesn't need for analysis anyway;
Example
First we try to read a big data file (10 millions rows)
> system.time(df <-read.table(file="bigdf.csv",sep =",",dec=".")) Timing stopped at: 160.85 0.75 161.97 

 I let this run for a long period but no answer.

With this new method, we load the first rows, determine the data type and then, run read.table with indications of datatype.
> system.time (ds <- read.table("bigdf.csv", nrows=100, dec=".",sep=",")) user system elapsed 0 0 0 > classes <-sapply(ds, class) > classes V1 V2 V3 V4 "integer" "factor" "factor" "factor"

system.time(ds<-read.table("bigdf.csv",dec=".",sep=","colClasses=classes))
user  system elapsed 
234     432    128
As we see, this technique is not very interesting. It's also longer.
- We can use the package sqldf.

> require(sqldf)
> f <- file("bigdf.csv")
> system.time(SQLf <- sqldf("select * from f", dbname = tempfile(),
+                           file.format = list(header = T, row.names = F)))
Le chargement a nécessité le package : tcltk
   user  system elapsed 
  53.64    4.17   58.20 

Less of 1 minute  to import 10 millions rows  of an object of
> print(object.size(SQLf), units="Mb")
267 Mb

-          We can aslo used package read.table
> require(data.table)
Le chargement a nécessité le package : data.table
data.table 1.8.8  For help type: help("data.table")
> system.time(DT <- fread("bigdf.csv"))
   user  system elapsed 
 133.11    0.56  133.93 

But DT is a data.table format and a bit of transformation is require for use the table as dataframe using ddply from plyr package.

So. The point is : the package Sqldf is very useful to read quickly a large dataset in R. 10 millions rows in Less of a minute.


Jun 1, 2013

How logistic regression work ?

Discussing with a non statistician colleague, it seems that the logistic regression is not intuitive; Some basics questions like :
 - Why don't use the linear model?
 - What's logistic function?
 - How can we compute by hand, step by step to listen what is dealing by the glm function?

This post aims to answer that questions and may be this helps.

Suppose that we have this data : http://www.info.univ-angers.fr/~gh/wstat/pg.dar

 ID TAILLE GROUPE
1 A01    130      0
2 A02    140      0
3 C01    162      0
4 C02    160      1
5 A03    136      0
6 C03    165      1
 
and we want to predic the group according to the height. The problematic can be the level of risk according to age, or the customer segment according the amounts of transaction, etc. 
Let's remind.
When we compute a linear model (let's assume just one predictor : simple linear model), we have : E(y) =Cste + a1x1. Linear regression like all regressions focuses on the conditional probability distribution of Y given X.

The first think generally do is to draw the groupe = f(taille), we got :
The idea of Generaliszed Model (logistic regression is a particular ) is to replace E(Y) by something else.
For our example, we are interested by the probability of a person to be in group 0 or 1.
So, Instead of E(y) =Cste + a1x1, we seek P(Groupe==1) = a0 +a1*Taille. But, to solve the roblem, which is exactly the same to the other hand, we have to transform left hand side using a bijection between the interval[0,1]. That means to seek a "link" function that can help us to work in R.

The most useful function in logistic regression is : logit(p) = log(p/1-p). But one can also use the inverse of normal distribution(probit), the log-log distribution, or poisson distribution.

The method used to perform logistic regression is the maximization of likelihod estimator (MLE)
Read this post.
We sum up :
- Suppose in a population from which we are sampling, each individual has the same probability p to be in groupe 1 or groupe 0
- The likelihood is the joint probability of the data L = Product(P ** {Gourpe = 1} *(1 - p)**{Groupe = 0})
** Means power
For instance, we use log-likelihood.



How to interpret the likelihood :?

When we try to assign the group for a new id, it's natural to assign the group which have the best probability according to height.


Apply the MLE and perform logistic regression is done by


 



> Test = fit.logis(y=don$GROUPE,x=don$TAILLE)
> Test
  coef.est std.err
a  -27.190   8.885
b    0.181   0.058

 
 We can get the same output using glm function with "binomial" option.
>viaglm

Call:  glm(formula = don$GROUPE ~ don$TAILLE, family = "binomial", data = don)

Coefficients:
(Intercept)   don$TAILLE  
   -27.2103       0.1812  

Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
Null Deviance:     38.19 
Residual Deviance: 10.89  AIC: 14.89


So, we can see that our optimisation via optim function is quite equivalent to glm function.
Just have a look

May be this helps to understand how it works !