Skip main navigation

Introduction to Week 3

There are many software tools that can be used for data management and analysis. We have chosen R and RStudio due to our very good experiences.
R is an open source programming language and software environment for statistical computing. It is widely used among statisticians and data miners for developing statistical software and data analysis.
© PRACE and University of Ljubljana

Introduction

This week we turn our attention to R and RHadoop. We have chosen R for big-data management and analysis since it is widely accepted by the data-science community and has a very active support community. We do not expect proficiency in R, but some experience with this tool will be very useful.

Goals

This week we will:

  • first recall how to perform basic data management in R (e.g., loading, storing or creating the data);
  • repeat or learn how to perform basic data analyses with R (e.g., computing frequencies, mean values and deviations around mean values);
  • repeat or learn how to do basic matrix operations in R (e.g., creating a matrix, summing matrices by rows or columns, multiplication of matrices, etc.);
  • how to run RHadoop and create, load or store a big data file from/to a distributed file system;
  • how to perform a few examples of big-data analysis using RHadoop: counting the sizes of the groups, computing the group centroids, computing the largest values in each group, finding the words with the highest frequencies.

Methods and tools

We will make hands-on presentations of all the data-management and analysis methods. We expect that each course member will have installed the virtual machine box and is running Hadoop and RStudio with Rhadoop packages rhdfs and rmr2 within this virtual environment. Therefore, he/she should copy-paste the examples into the Rstudio and try to run them. Any feedback and suggestions for improvements are welcome.

Additional material

If you are completely new to R we suggest you to read the introductory parts of one of the many good R manuals.

© PRACE and University of Ljubljana
This article is from the free online

Managing Big Data with R and Hadoop

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education