Learning Data Science : July 2013

Jul 16, 2013

What are my chances to talk to this girl? Fisher or Bayes

Robert Mathews said that : "Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and ukes into funding. It is time to pull the plug.". He's right.
In one previous life, I wrote a thesis in Philosophy. But, a specific area, Epistemology also called
theory of knowledge, because, It questions what knowledge is and how it can be acquired,
and the extent to which any given subject or entity can be known.

My thesis deal about : The tradition, since Cournot in applying mathematical modelling to social sphere and more specific, how the climate modelling interact with interdiscplinary source of knowledge (mathematics, physics,geography, philosophy).

After reading this : http://www.academia.edu/1075253/Climate_Change_Epistemic_Trust_and_Expert_Trustworthiness
It seems that the use of bayesian statistics is misunderstood.

Assume two thesis :

According to the Bayesian's statistic is the science who deal about the degree of of proofs in the observations. That means that, Bayesian statistics self-contained paradigm providing tools and techniques for all statistical problems.
In the classical frequentist view point of statistical theory, a statistical procedure is judged by averaging its performance over all possible data

However, the bayesian approach gives prime importance to how a given procedure performs for
the actual data observed in a given situation.

The core of this theory have been formalised by Popper (Karl) with two main principles :

Knowledge cannot start from nothing — from a tabula rasa – nor yet from observation. The advance of knowledge consists, mainly, in the modification of earlier knowledge. Although we may sometimes, for example in archaeology, advance through a chance observation,the significance of the discovery will usually depend upon its power to modify our earlier theories.
Any probability is a degree of belief about something; It's not a property.That means that any scientific model produce data absolutely or conditionnaly on probability. It's possible to measure how the data can modify the degrees of belief (baye's principle).

One of the result of frequentist theory, may be the most criticized, is the fisher s p-value.

A problem can be to evaluate how the diploma is important to get in the first job

Hypothese :

- The diploma has no effect on the first job. In frequentist, we compute the p-value that can be
interpreted as the probability to observe a difference at least as important observed in the data if our hypothesis is true.

Read more: http://www.answers.com/topic/bayesian-statistics#ixzz2UKZDkq3v

So, let us talk about the p-value and this problem :
If you cross seven times a beautiful girl each day for 10 days at the same place, can you conclude that she is always there? And then, tomorrow, youalways have a chance to speak with her.
We're going to examine both approach : Fisher (based on p_value) or Jeffreys (Bayesian)

With Fisher, we have :

H0 : p=0.5 and H1 :p=0.5
where p is "the probability to cross the girl". If we reject H0 (fisher) we can conclude that the girl is always there.
the p_value is :

> 2*pbinom(3,10,0.5)

0.34375
which is greater than 5% (the arbitrary threshold that everybody likes). indeed, we cannot conclude... Thank's Fisher; I can't trust you. 

With Bayes, we take our two hypothesis again and we suppose p(H0)=p(H1)=0.5. We have the same chance to cross her every day. With bayesian approach, wa need to compute the likelihood,  wich require, a priori distribution, that can be sum up (in our example) by "your a priori belief of cross again and to talk to her"


Let s compute this :
> (p <- c(0.5,0.6,0.7))
[1] 0.5 0.6 0.7
> (apriori <- c(0.5,0.3,0.2))
[1] 0.5 0.3 0.2
> (vraisemblance <- dbinom(7,10,p))
[1] 0.1171875 0.2149908 0.2668279
> (loi.jointe <- apriori*vraisemblance)
[1] 0.05859375 0.06449725 0.05336559
> (p.y <- sum(loi.jointe))
[1] 0.1764566
> (aposteriori <- loi.jointe/p.y)
[1] 0.3320576 0.3655134 0.3024290
 
So, the conditionnal probability p(H0/y)=is 33% and P(H1/y) = 66% .
There are two out of three chance that the girl is always there. In other words;
I have 66% of chances to cross her tomorrow.
Bayesian approach is hopeful. I like it; I can take my time with the girl next door.

Jul 10, 2013

Analyse discriminante linéaire ou Regression logistique

Supposons que l'on dispose d'iris de Paris (en population >100khabts) et qu'on veuille pouvoir les classer selon leurs caractéristiques sociodémos :

Population
taux de chômage
Etudiants
CSP
etc...

Une fois, les iris classés, on se demande si l'on peut transporter cette typologie à une autre grande ville (Lyon) par exemple : Il faudrait alors pouvoir utiliser un modèle d'affectation des iris selon leurs caractéristiques respectives à des classes prédéfinies. Le choix de la méthode dépend en général dans ce cas de ... pas grand chose comme indiqué ici. L'analyse discriminante ou une régression logistique, ou un SVM, pour ne citer que ces trois méthodes, ferait très bien l'affaire; Mais les résultats sont-ils identiques ou mieux, quelle est la méthode qui commet le moins d'erreur et pourquoi?

On va tester deux méthodes ici :

L'analyse discriminante
La régression logistique multinomiale

Le programme :
# Etape 1 : Réduction de la dimension par une ACP pour diminuer le nombre de variables
# Etape 2 : CAH sur les facteurs de l'ACP --> Définition des classes
# Etape 3: Analyse discriminante pour trouver les règles d'affectation aux classes
# Etape 4: Reg. log. pour trouver les mêmes règles d'affectation
# Etape 5: Comparaison des résultats des méthodes sur un échantillon test et mesure du taux de mal classés

1. Réduction de la dimension

# Etape 1 : Réduction de la dimension
require(ade4)
Vm.acp <-dudi.pca(df=Vm,center=TRUE,scale=TRUE,scan=FALSE)
# Eboulis des valeurs propres
screeplot(Vm.acp,type ="l", main = 'choix du nombre d axes')
nbaxes =3;
Vm.acp <-dudi.pca(df=Vm,center=T,scale=T,nf=nbaxes,scannf=FALSE)
s.corcircle(Vm.acp$co,xax=1,yax=3,clabel=0.7,sub="C. de corr des variables",possub="topright")
summary(Vm.acp)
New.vm<-Vm.acp$li

2. CAH sur les facteurs de l'ACP

# CAH sur les facteurs de l'ACP
preclus<-dist(New.vm)**2 # La CAH op?re sur les distances
cah.vm<-hclust(d=preclus,method="ward")
# Visualisation de la CAH
par(mfrow=c(1,1))
nbclasses.fi =3
groups<-cutree(cah.vm,k=nbclasses.fi)
plot(cah.vm)
rect.hclust(cah.vm,k=nbclasses.fi,border="blue")

Les résultats de la classif

# Rattachement des groupes pour chacun des iris
> discrim.vm<-cbind(Vm,cluster=as.factor(groups))
> table(discrim.vm$cluster)

  1   2   3 
180 483 250

3. Analyse discriminante pour affectation dans les classes

> index <- sample(nrow(discrim.vm), nrow(discrim.vm)*.70)
> #Echantillon d'apprentissage
> appren <- discrim.vm[index, ]
> #Echantillon de test
> test <- discrim.vm[-index, ]
> vm.disc= f.lda(appren[,-29],groups=appren$cluster)
> # Matrice de confusion
> pred.vm <-predict(vm.disc,newdata=test)
> # Taux  d'erreur
> Tx_err <- function(y,ypred){
+ mc <- table(y,ypred)
+ error <- 100*(mc[1,2]+mc[2,1])/sum(mc)
+ print(mc)
+ print(paste(round(error,2),"%",sep =""))
+ }
> Tx_err(test$cluster,pred.vm$class)
   ypred
y     1   2   3
  1  47   4   1
  2   3 142   2
  3   0  11  64
[1] "2.55%"
 
Une erreur de prévision de 2% avec une analyse discriminante
 
4. Avec une régression logistique multinomiale
 
# Avec une régression logistique multinomiale:
> # Avec une régression logistique multinomiale:
> library(nnet)
> logi=multinom(cluster~., data = appren)
# weights:  90 (58 variable)
initial  value 702.013252 
iter  10 value 113.648468
iter  20 value 75.506364
iter  30 value 69.704088
iter  40 value 65.256664
iter  50 value 62.358589
iter  60 value 61.152375
iter  70 value 60.246843
iter  80 value 59.528363
iter  90 value 58.957304
iter 100 value 58.452446
final  value 58.452446 
stopped after 100 iterations
> pp = predict(logi,newdata=test)
> Tx_err(test$cluster,pp)
   ypred
y     1   2   3
  1  51   0   1
  2   1 140   6
  3   1  12  62
[1] "0.36%"
 
 
Verdict : La reg. log. fait beaucoup mieux que l'analyse discriminante.
 
Certains auteurs  avaient déjà commencé à traiter la question. Ils voyaient l'analyse discriminante linéaire comme un cas particulier de la régression logistique (ce avec quoi je ne suis pas totalement d'accord, car aucune de ces deux méthodes ne s'affranchit de l'hypothèse forte de normalité et utilise les moments d'ordre deux à chaque fois)
Quelques conclusions néanmoins  intéressantes :

1. Lorsque le nombre de paramètre à estimer est important, le temps que met une reg log peut être 1,5 fois plus important que l'ADL du fait de l'algo itératif
2. La rég. log. peut être plus précise sur de petits échantillons, car du fait des modalités de référence, le nombre de paramètres à estimer est plus faible que si l'on utilise une ADL
 

GT

Jul 7, 2013

ggmap : Interesting toolbox for spatial analysis

ggmap is a new tool which enables such visualization by combining the spatial information of static maps from Google Maps, OpenStreetMap, Stamen Maps or CloudMade Maps with the layered grammar of graphics implementation of ggplot2

The library is developped by David Kahle and Hadley Wickham and in the latest R/Journal (Volume 5/1, June 2013), there is a whitepaper, very interesting that should to be read.

Let's use it to see where we can always buy tobacco at night in Paris.

Data are from data.ratp.fr, and give all authorized dealers by ratp. There's long and lat in this file, but this geocoding use lambert. So to show how ggmap work, we can extract adress, run geocoding using api/google and then ouput thematic map.

> require(ggmap)
> setwd("C:\\Users\\guibertt\\Desktop\\Geocodage - A faire")
> fic<-"commerces.csv"
> commerces<-read.csv(fic,header=T,sep=";", dec=",")
> head(commerces[,c(1,3:6)])
DEA_CODE    ADRESSE_LIVRAISON DEA_CODE_POSTAL_LIVRAISON DEA_COMMUNE_LIVRAISON INSEE_LIVRAISON
1   510001       11   R. Mozart                     92230         Gennevilliers           92036
2   510002     81   Bd Voltaire                     92600    Asnières-sur-Seine           92004
3   510003 60   Av. Jean Moulin                     92390 Villeneuve-la-Garenne           92078
4   510004     2   Av. Michelet                     93400            Saint-Ouen           93070
5   510006      31   R. d'Anjou                     92600    Asnières-sur-Seine           92004
6   510007 45   R. Jules Larose                     92230         Gennevilliers           92036

We use the api google (limited at 2500 requests by day/ip adress) to geocode our places
>ad<-as.vector(tabac$adresse2)
>system.time(gc <- geocode(ad,output='latlona',messaging=FALSE))
> head(gc)
       lon      lat                                        address
1 2.267043 48.85066         108 avenue mozart, 75116 paris, france
2 2.297897 48.84552 56 rue de la croix nivert, 75015 paris, france
3 2.276529 48.84276             20 rue cauchy, 75015 paris, france
4 2.293242 48.83890 157 rue de la convention, 75015 paris, france
5 2.285362 48.83406       37 boulevard victor, 75015 paris, franc
>tabacp<-cbind(tabac,gc)

But, don't forget that ggmap is just a ggplot and we can get this, for example using OpenStreetMap

So we can now plot our places
png("paris4.png", width=800,height=600)
map <- get_map(location = 'paris',zoom=13,maptype="roadmap",color="color",source="google")
mymap = ggmap(map, darken = c(.3,'white'))
mymap+
stat_bin2d(
aes(x = lon, y = lat, colour = typecommerce, fill = typecommerce),
size = .5, bins = 30, alpha = 1/2,
data = tabacp)
dev.off()

And at the end, with wrap, to split
png("paris5.png", width=800,height=600)
map <- get_map(location = 'paris',zoom=13,maptype="roadmap",color="bw",source="google")
mymap = ggmap(map, darken = c(.8,'white'))
mymap + stat_bin2d(
aes(x = lon, y = lat, colour = TCO_LIBELLE, fill = TCO_LIBELLE),
size = .5, bins = 30, alpha = 1/2,
data = tabacp)+facet_wrap(~ TCO_LIBELLE)
dev.off()

About Me