Introduction to Week 3
This week we turn our attention to
RHadoop. We have chosen
R for big-data management and analysis since it is widely accepted by the data-science community and has a very active support community. We do not expect proficiency in
R, but some experience with this tool will be very useful.
This week we will:
- first recall how to perform basic data management in
R(e.g., loading, storing or creating the data);
- repeat or learn how to perform basic data analyses with
R(e.g., computing frequencies, mean values and deviations around mean values);
- repeat or learn how to do basic matrix operations in
R(e.g., creating a matrix, summing matrices by rows or columns, multiplication of matrices, etc.);
- how to run
RHadoopand create, load or store a big data file from/to a distributed file system;
- how to perform a few examples of big-data analysis using
RHadoop: counting the sizes of the groups, computing the group centroids, computing the largest values in each group, finding the words with the highest frequencies.
Methods and tools
We will make hands-on presentations of all the data-management and analysis methods. We expect that each course member will have installed the virtual machine box and is running
rmr2 within this virtual environment. Therefore, he/she should copy-paste the examples into the
Rstudio and try to run them. Any feedback and suggestions for improvements are welcome.
If you are completely new to
R we suggest you to read the introductory parts of one of the many good R manuals.
© PRACE and University of Ljubljana