Skip main navigation

Setting up the RHadoop working place

Here we summarize all steps needed to set up the RHadoop working place:

- running virtual machine and `Hadoop`;
- running `R`, `Rstudio` and `Rhadoop`.
4.7
Dear online students, with the help of this video, I will guide you through the process of setting up your RHadoop working place. I am assuming that (i) you already have the latest version of virtual box and (ii) you have successfully downloaded our mint_hadoop virtual machine, where the RHadoop environment is prearranged. We will run RHadoop in 5 steps. First, we search on our computer for the virtual box and run it. In the VirtualBox we find our virtual machine called mint_hadoop and run it. If you receive any notification about the keyboard or mouse, we suggest that you ignore it. Second, we log in as hduser with the password ‘’hadoop’’. We can see a welcome screen which we can also close.
66.9
In the third step we run Hadoop. First, we start the terminal window.
73.1
We run Hadoop with two commands: Start-dfs.sh is used to run the hadoop distributed file system. This establishes one namenode and the related datanodes, in our case only one datanode. Next we run start-yarn.sh to start master and node resource managers and map reduce. Now Hadoop is running. In the fourth step we run R, which we will use to create and submit map-reduce tasks to Hadoop. We decided to use RStudio, which is a free and open-source, integrated development environment for R. We run R through RStudio from the terminal window. Note that if the running of the script ‘rstudio’ reports some warnings then they are probably related to missing fonts. We ignore them and just press enter.
139.1
In the last step we set up RStudio for data analysis with RHadoop. We open a new script file and save it to your local folder. It the beginning, we must set the system environment for Hadoop. These lines define the system variables. We copy them into the script file, mark them and execute by pressing ctrl+enter. Finally, we load the basic RHadoop libraries. We establish our connectivity to the Hadoop Distributed File System by loading the library rhdfs. To perform a statistical analysis in R with Hadoop MapReduce we also need to load library rmr2, where the scripts for the map and reduce operations are defined. We close this last step with the execution hdfs.init().
207.1
Now RHadoop is ready and we can start writing scripts for the big-data analysis.
214.3
When we want to close the RHadoop session we: Save all the script files and if needed also the workspace; Close RStudio by clicking the close button; Stop Hadoop by typing stop-yarn.sh and stop-dfs.sh; Stop the terminal window and the mint_hadoop virtual machine by clicking the close button.
This article is from the free online

Managing Big Data with R and Hadoop

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education