Skip main navigation

Introduction to Week 3

There are many software tools that can be used for data management and analysis. We have chosen R and RStudio due to our very good experiences.
R is an open source programming language and software environment for statistical computing. It is widely used among statisticians and data miners for developing statistical software and data analysis.
© PRACE and University of Ljubljana

Introduction

This week we turn our attention to R and RHadoop. We have chosen R for big-data management and analysis since it is widely accepted by the data-science community and has a very active support community. We do not expect proficiency in R, but some experience with this tool will be very useful.

Goals

This week we will:

  • first recall how to perform basic data management in R (e.g., loading, storing or creating the data);
  • repeat or learn how to perform basic data analyses with R (e.g., computing frequencies, mean values and deviations around mean values);
  • repeat or learn how to do basic matrix operations in R (e.g., creating a matrix, summing matrices by rows or columns, multiplication of matrices, etc.);
  • how to run RHadoop and create, load or store a big data file from/to a distributed file system;
  • how to perform a few examples of big-data analysis using RHadoop: counting the sizes of the groups, computing the group centroids, computing the largest values in each group, finding the words with the highest frequencies.

Methods and tools

We will make hands-on presentations of all the data-management and analysis methods. We expect that each course member will have installed the virtual machine box and is running Hadoop and RStudio with Rhadoop packages rhdfs and rmr2 within this virtual environment. Therefore, he/she should copy-paste the examples into the Rstudio and try to run them. Any feedback and suggestions for improvements are welcome.

Additional material

If you are completely new to R we suggest you to read the introductory parts of one of the many good R manuals.

© PRACE and University of Ljubljana
This article is from the free online

Managing Big Data with R and Hadoop

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now