About Me

My photo
Just wasting time sharing knowledge about, Big Data and Analytics

Dec 30, 2013

Hadoop for R's Data scientist

I don’t exactly know where to start. But, after a real pleasant discussion with one of my ex colleague, it seems that there are many thongs around Hadoop ecosystem and R for analyst that should be said by a data scientist, means that, someone who don’t know much more about big data architecture, but who should know the essentials about the simple architecture that can allow to run better analysis in the best conditions.

For sure, good knowledge about how R can use hadoop platform to run better analysis is very important.

A small plan :

- Hadoop  ecosystem

- R and Hadoop

- Launch big data job on R using Hadoop



Hadoop ecosystem


We use Hadoop when :

- you need agility

- you need to perform analysis with diversity of data sources

- your  architecture require to move on tim

- you need to reduce your costsI

Hadoop is :

An ecosystem

Designed for storage and analytical computation

Design for running analysis in parallel

When we talk about Hadoop, we deal with

HDFS (Hadoop Distributed File System : The core of solution)

Map Reduce :
Use data from HDFS and execute algorithm based on mar-reduce paradigm

High Level languages: Pig, Hive : Query language with embedded the map reduce paradigm to solve complex problems

HBase : the couch on the top of HDFS storage to build and manipulate data when random access memory is needed.

Pig or Hive : I don’t know. As you feel

You can get your own opinion reading this http://stackoverflow.com/questions/3356259/difference-between-pig-and-hive-why-have-both, but, it’s important to remind that as data scientist, it is always better to get structured data. When we have to do this, we should think about the best languages wich can suit for this quickly. Both languages run for sure map reduce paradigm and every operation is reduce to map and reduce.

Another thing to remind is:

–      Hive: HQL is like SQL

–      Pig: Pig Latin is a bit like Perl

In my next post, I will focus on Hive and Pig. For now, I just want to point that, Hive & Pig are components of Hadoop and Data scientist should know how to deal about

There’s a good post on Internet which explain how to install Hadoop and how to make it connected with R. In my opinion, the best is : Tutorial R and Hadoop except the part wich explain that we need homebrew to set Hadoop environment

Interact with Hadoop








In the terminal, we can build basic operations. For example, pick up data from cluster and load it into memory for analysis with R;




We can then load this data with R command

We can also do this in R terminal without dealing with the Mac's terminal
> system("~/hadoop/bin/hadoop fs -copyToLocal ~/hadoop/data/mag.csv ~/Documents/Recherches/test2.csv")

The main hadoop command can be find out there : http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html

A small map reduce job

Map reduce is a programming pattern witch aid in the parallel analysis of data.
The name is because of two parts of the algorithm :

map = to identify the subject of the data by key
reduce = group by key identified and run analysis

There is many packages to run map reduce job in R :

* HadoopStreaming
* Hive
* Rhipe
* RHadoop (with rmr2,Hbase, Rhdfs) are  maintain by Revolution Analytics and provide some of good functions to interact with hadoop environment
etc... For sure, I overlook many others good packages.

Let us introduce how to run a simple map reduce job using RHadoop
suppose that we have this data :
> x=sample(rep(c("F","V","R"),10000),size=1000000,replace=T)
> df=data.frame(value=x, note=abs(9*rnorm(1000000)))
> head(df)
  value      note
1     F  4.209874
2     V  9.587087
3     F  6.323354
4     V  9.274668
5     R 13.886767
6     V  5.273159
> dim(df)
[1] 1000000       2

And we want to determine witch value has "note" greater than the mean of all value
If we have to do this using R, we should run this :
> meanAll = mean(df$note)
> meanAll
[1] 7.18068
> meanGroup<-aggregate(x=df$note,by=list(df$value),FUN=mean)
> meanGroup
  Group.1        x
1       F 7.170956
2       R 7.189213
3       V 7.181848
> index =meanGroup$x>=meanAll
> index
[1] FALSE  TRUE  TRUE
> meanGroup$Group.1[index]
[1] R V

If we want to do this with map reduce, we will do something like this
demo = to.dfs(demo)
monMap = function(k,v)
{
  w <- v[2]
  keyval(w,v[1])
}
monReduce<-function(k,val)
{
  keyval(k, mean(val))
}
job<-mapreduce(input=demo, map =monMap,reduce = monReduce)
from.dfs(job)

Some helpful litterature around Map Reduce
http://www.asterdata.com/wp_mapreduce_and_the_data_scientist/
https://class.coursera.org/datasci-001/lecture/71
http://www.information-management.com/ad_includes/welcome_imt.html

In my next post, I will talk about Pig and Hive for preparing dataset beforme machine learning,





11 comments:

  1. There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.

    Hadoop training in velachery
    Big data training in velachery
    Hadoop training chennai velachery

    ReplyDelete
  2. From my view, the only difference between learning at hadoop online training and learning from the informative blogs like this is that, here in these blogs we can connect to more examples along with the videos which helps us to understand the subject to the point.

    ReplyDelete
  3. The blog is so interactive and Informative , i Request you to write more blogs like this Hadoop Administration Online Training

    ReplyDelete
  4. Nice blog about Using Hadoop for R's Data scientist. Thanks for share with us. If you need data management services then ConroyCreativeCounsel is the best place for you.

    ReplyDelete
  5. Impressive posting, really liked reading it. I like your writing style, it’s quite unique. Thanks for sharing the information here. Become a Certified DevOps Expert with Comprehensive DevOps Training

    ReplyDelete
  6. Investing in YouTube subscribers in rupees is a strategic decision for creators aiming to amplify their digital presence on a budget. This approach addresses the challenges of organic growth by providing an immediate lift in subscriber counts, enhancing the channel's visibility and appeal. With tailored packages, creators can choose an investment that fits their budget, ensuring an affordable path to boosting their channel's profile. The secure, straightforward payment process adds to the convenience, offering peace of mind alongside effective results. This tactic serves as a catalyst for attracting a larger audience, setting the stage for enhanced viewer engagement and retention. Ultimately, buying YouTube subscribers in rupees is a practical, cost-effective strategy for creators to fast-track their channel's growth, widen their reach, and establish a stronger foothold in the competitive landscape of digital content creation.
    https://www.buyyoutubesubscribers.in/

    ReplyDelete
  7. The pinnacle of Lasik surgery in Delhi is defined by its fusion of cutting-edge technology, exceptional surgeon expertise, and a patient-centered approach. Here, individuals seeking vision correction find a sanctuary of innovation where advanced laser systems ensure precision and efficacy in every procedure. Delhi's clinics pride themselves on offering bespoke treatment plans, crafted after in-depth assessments to guarantee outcomes that transcend expectations. The surgeons, world-renowned for their skill and dedication, commit to patient safety and satisfaction as guiding principles. Additionally, the affordability and transparency of the process demystify the financial aspect, making world-class eye care accessible to all. The swift recovery times herald a rapid return to normalcy, with patients often experiencing a dramatic improvement in vision almost immediately. This blend of technological sophistication, medical excellence, and compassionate care positions Delhi as a leading destination for those in pursuit of the best Lasik surgery experience.
    https://www.visualaidscentre.com/

    ReplyDelete
  8. Buying Real YouTube Views offers a distinctive advantage for content creators focusing on genuine, sustainable channel growth. This approach ensures that each view on your videos comes from a real, active user, which not only increases the view count but also enhances engagement metrics like likes, comments, and shares. Beyond inflating numbers, real views serve to build your video's credibility and authority, making it more attractive to future viewers. This increased interaction signals to YouTube's algorithms that your content is valuable, improving its ranking and visibility on the platform. Furthermore, by fostering a more engaged audience, content creators can develop a loyal viewer base more inclined to subscribe and participate in the community. Implementing this strategy judiciously can propel a channel beyond mere numbers, driving authentic engagement and fostering a profound connection with viewers globally.
    https://www.buyyoutubeviewsindia.in/

    ReplyDelete
  9. The cheapest web hosting services in India are transforming how individuals and small businesses venture into the digital domain. They combine affordability with a rich feature set, including extensive bandwidth, substantial storage, and multi-domain support, to cater to diverse online needs. These services are designed for user convenience, featuring intuitive tools like cPanel for seamless website management, thus removing barriers for those with limited technical skills. Despite their low cost, these hosting packages do not compromise on cybersecurity, offering robust protections against online threats. Their reliability is unmatched, providing consistent uptime to ensure websites are always accessible to visitors. This approach makes web hosting not only economically viable but also a secure and reliable solution for establishing a strong online presence. In essence, the cheapest web hosting options in India are critical enablers for digital entrepreneurship and innovation, making them indispensable in today’s internet-driven world.
    https://onohosting.com/

    ReplyDelete