3.1

## Partnership for Advanced Computing in Europe (PRACE)

R is an open source programming language and software environment for statistical computing. It is widely used among statisticians and data miners for developing statistical software and data analysis.

# Introduction

This week we turn our attention to R and RHadoop. We have chosen R for big-data management and analysis since it is widely accepted by the data-science community and has a very active support community. We do not expect proficiency in R, but some experience with this tool will be very useful.

# Goals

This week we will:

• first recall how to perform basic data management in R (e.g., loading, storing or creating the data);
• repeat or learn how to perform basic data analyses with R (e.g., computing frequencies, mean values and deviations around mean values);
• repeat or learn how to do basic matrix operations in R (e.g., creating a matrix, summing matrices by rows or columns, multiplication of matrices, etc.);
• how to run RHadoop and create, load or store a big data file from/to a distributed file system;
• how to perform a few examples of big-data analysis using RHadoop: counting the sizes of the groups, computing the group centroids, computing the largest values in each group, finding the words with the highest frequencies.

# Methods and tools

We will make hands-on presentations of all the data-management and analysis methods. We expect that each course member will have installed the virtual machine box and is running Hadoop and RStudio with Rhadoop packages rhdfs and rmr2 within this virtual environment. Therefore, he/she should copy-paste the examples into the Rstudio and try to run them. Any feedback and suggestions for improvements are welcome.

If you are completely new to R we suggest you to read the introductory parts of one of the many good R manuals.