Learning Data Science : How logistic regression work ?

Discussing with a non statistician colleague, it seems that the logistic regression is not intuitive; Some basics questions like :
- Why don't use the linear model?
- What's logistic function?
- How can we compute by hand, step by step to listen what is dealing by the glm function?

This post aims to answer that questions and may be this helps.

Suppose that we have this data : http://www.info.univ-angers.fr/~gh/wstat/pg.dar


    
      
        
  
    
    

        


  


  
        
          
          url <-"http://www.info.univ-angers.fr/~gh/wstat/pg.dar"
        
        
          
          don<-read.table(url,head=T)
        
  



    

  


      
      
        view raw
        
          logoit1.r
        
        hosted with ❤ by GitHub
      
    


 ID TAILLE GROUPE
1 A01    130      0
2 A02    140      0
3 C01    162      0
4 C02    160      1
5 A03    136      0
6 C03    165      1

and we want to predic the group according to the height. The problematic can be the level of risk according to age, or the customer segment according the amounts of transaction, etc. 
Let's remind.
When we compute a linear model (let's assume just one predictor : simple linear model), we have : E(y) =Cste + a1x1. Linear regression like all regressions focuses on the conditional probability distribution of Y given X.

The first think generally do is to draw the groupe = f(taille), we got :

The idea of Generaliszed Model (logistic regression is a particular ) is to replace E(Y) by something else.
For our example, we are interested by the probability of a person to be in group 0 or 1.
So, Instead of E(y) =Cste + a1x1, we seek P(Groupe==1) = a0 +a1*Taille. But, to solve the roblem, which is exactly the same to the other hand, we have to transform left hand side using a bijection between the interval[0,1]. That means to seek a "link" function that can help us to work in R.

The most useful function in logistic regression is : logit(p) = log(p/1-p). But one can also use the inverse of normal distribution(probit), the log-log distribution, or poisson distribution.

	png("Log", width=6,height=7)
	x = seq(0.00001,0.9999,length = 100)
	logit<-function(t)
	{
	log(t/(1-t))
	}
	curve(logit(x),col = "tomato",lwd = 2)
	curve(qnorm(x),col = "blue",lwd = 2,add=T)
	curve(log(-log(1-x)),col = "purple",lwd = 2,add=T)
	a=par("usr")
	legend(a[1],a[4], c("logit","probit","log-log"), col = c("tomato","blue","purple"),lty=1)
	dev.off()

view raw logit4.r hosted with ❤ by GitHub

The method used to perform logistic regression is the maximization of likelihod estimator (MLE)
Read this post.
We sum up :
- Suppose in a population from which we are sampling, each individual has the same probability p to be in groupe 1 or groupe 0
- The likelihood is the joint probability of the data L = Product(P ** {Gourpe = 1} *(1 - p)**{Groupe = 0})
** Means power
For instance, we use log-likelihood.

How to interpret the likelihood :?

When we try to assign the group for a new id, it's natural to assign the group which have the best probability according to height.

Apply the MLE and perform logistic regression is done by

	# Lik
	lik.logit <- function(init,y,x)
	{
	x = as.matrix(x)
	cste<- rep(1,length(x[,1]))
	x <- cbind(cste,x) # Matrix of predictors
	d <- init[1:ncol(x) ] # Number of parameters
	xd<- x%*%d # Produit matriciel
	sum( ylog(1+exp(-xd)) + (1-y)log(1+exp(xd)))
	}

	fit.logis <-function(y,x)
	{
	init=c(0,1)
	logit.opt <- optim(init,lik.logit,y=y,x=x,hessian = T)
	coef.est <- logit.opt$par
	varcov = solve(logit.opt$hessian)
	et.est = sqrt(diag(varcov))
	res<-data.frame(cbind(round(coef.est,3), round(et.est,3))) # Estimation des coefs + ecarts-type
	rownames(res)<-letters[1:length(coef.est)]
	colnames(res)<-c("coef.est", "std.err")
	return(res)
	}
	Test = fit.logis(y=don$GROUPE,x=don$TAILLE)

view raw logit5.r hosted with ❤ by GitHub

> Test = fit.logis(y=don$GROUPE,x=don$TAILLE)
> Test
  coef.est std.err
a  -27.190   8.885
b    0.181   0.058

	# We compute the inverse logistic function
	ilogit <- function (l) {
	exp(l) / ( 1 + exp(l) )
	}
	attach(don)
	viaglm<- glm(GROUPE~TAILLE,don,family="binomial")
	png("Comparison.png", width=1280,height=800)
	plot(TAILLE,GROUPE, pch=16)
	new<- seq(min(TAILLE),max(TAILLE),by=1)
	# Prev avec R
	prevviaglm <- predict(viaglm,data.frame(TAILLE=new),type='response')
	lines(new,prevviaglm, col="salmon1", lwd=6, lty=2)
	# Prev à la amin
	lines(new, ilogit(Test$coef.est[1]+new*Test$coef.est[2]), col = "navyblue", lwd=2)
	legend( .95par('usr')[1]+.05par('usr')[2], .9,
	c("Avec predict glm",
	"A la main"),
	col=c("salmon1","navyblue"),lty=c(2,1), lwd=c(6,2))
	title(main="Rég log. à la main + avec glm()")
	dev.off()

view raw logit6.r hosted with ❤ by GitHub

 We can get the same output using glm function with "binomial" option.

>viaglm

Call:  glm(formula = don$GROUPE ~ don$TAILLE, family = "binomial", data = don)

Coefficients:
(Intercept)   don$TAILLE  
   -27.2103       0.1812  

Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
Null Deviance:     38.19 
Residual Deviance: 10.89  AIC: 14.89

So, we can see that our optimisation via optim function is quite equivalent to glm function.
Just have a look

May be this helps to understand how it works !

1 comment:

abdul quddosNovember 14, 2021 at 9:49 AM
Youre so cool! I dont suppose Ive learn something like this before. So good to find somebody with some unique ideas on this subject. realy thank you for beginning this up. this website is something that is needed on the net, somebody with somewhat originality. helpful job for bringing something new to the internet! data science from scratch

Learning Data Science

About Me

Jun 1, 2013

How logistic regression work ?

1 comment:

	url <-"http://www.info.univ-angers.fr/~gh/wstat/pg.dar"
	don<-read.table(url,head=T)