3.1

## Partnership for Advanced Computing in Europe (PRACE)

R is an open source programming language and software environment for statistical computing. It is widely used among statisticians and data miners for developing statistical software and data analysis.

# Introduction

In this week we turn our focus to R and RHadoop. We have chosen R for big data management and analysis since it is widely accepted by data science community and has also very active supporting community. We do not expect proficiency in R but some experiences with this toll are really welcome.

# Goals

In this week the listener will:

• first recall how to perform basic data management in R (e.g., loading, storing or creating the data);
• repeat or learn how to perform basic data analysis with R (e.g., computing frequencies, mean values and deviations around mean values);
• repeat or learn how to do basic matrix operations in R (e.g., creating matrix, summing matrices by rows or columns, multiplication of matrices etc.);
• how to run RHadoop and create, load or store big data file from/to distributed file system;
• how to perform few important tasks related to big data analysis using RHadoop: finding words with highest frequencies, computing the group centroids, computing largest values in each group;

# Methods

We try to make hands-on presentations of all data management and analysis methods. We expect that each follower of the course have installed the virtual machine box and is running Hadoop and RStudio within this virtual environment. Therefore he/she should copy-paste the examples in the local environment and try to run them. A feedback and suggestions for improvements are welcome.

# Tools

Based on our experiences with students we have realised that R gets much more popular if it is combined with appropriate user interface. We have found RStudio as a very good candidate, therefore we have included in into our Virtual Machine. All R operations in this MOOC are performed through RStudio.

If you are completely new to R we suggest you to read introductory parts of one out of many good R manuals