• Partnership for Advanced Computing in Europe (PRACE)

Managing Big Data with R and Hadoop

Learn how to manage and analyse big data using the R programming language and Hadoop programming framework.

13,356 enrolled on this course

Managing Big Data with R and Hadoop
  • Duration

    5 weeks
  • Weekly study

    6 hours

You will experience how to use RHadoop tool to manage and analyse big data.

This course will give you access to a virtual environment with installations of Hadoop, R and Rstudio to get hands-on experience with big data management. Several unique examples from statistical learning and related R code for map-reduce operations will be available for testing and learning.

Those with basic knowledge in statistical learning and R will better understand the methods behind and how to run them in parallel using map-reduce functions and Hadoop data storage. At the end of the course you will get access to RHadoop on a supercomputer at University of Ljubljana.

Download video: standard or HD

Skip to 0 minutes and 25 seconds Nearly every historical period may be said to have had sources of data that were considered big for that time. Books, documents, drawings, maps and paintings are examples of such data. Yet it is only today that we have to deal with really big data. Luckily, more and more data is digital, but expressed in different formats. Large-scale scientific instruments, social network platforms, cloud solutions, digital cultural heritage are only a few examples of sources of huge amount of text, photo, video and audio materials which are considered big data.

Skip to 0 minutes and 55 seconds But questions related to data have not changed much: how to store and maintain it, how to understand and how to learn from the data for an improved response in the future. These issues necessarily involve the use of high performance computers. Distributed storage and parallel computing need be considered to avoid loss of data and to make computations efficient.

Skip to 1 minute and 16 seconds Join us and cope with big data using R and RHadoop.

Syllabus

  • Week 1

    Welcome to BIG DATA

    • Welcome to the course!

      Here you will find out what we have in store for you over the next few weeks and how you can use the platform to make the most out of your learning experience.

    • Setting up the software

      We need to have a playground for the following weeks! For that we need to install Virtual Machine with Linux. You need a modest host machine with 15GB free disk space and 2GB RAM to start with.

    • Hands on Linux with RHadoop

      There are many pieces inside machine that we need to know and check it they work or to check ourselves if we know how they work before we can orchestrate the pieces together.

  • Week 2

    Working with Hadoop

    • Data Management

      Many tools and languages are possible to manage data.

    • Using AWK with Hadoop

      Sometimes users need to use AWK language for manipulation with files. Here we present an example of how one can use AWK and Hadoop.

    • Map and reduce

      In this section we explain the consept of map and reduce on a simple example of finding the maximal value in a given file.

  • Week 3

    First steps in R and RHadoop

    • Basic data management with R

      Here we present basic R scripts fo data management.

    • RHadoop

      In this activity we demonstrate how to use RHadoop to perform map-reduce operations within R.

    • Four Big Data examples: basic data operations with RHadoop

      Here we demonstrate how to perform basic operations above (big) data using RHadoop

    • Summary

      Here you can validate your understanding of the topics presented so far.

  • Week 4

    Statistical learning with RHadoop: clustering

    • Introduction to statistical learning

      Statistical learning denotes methods aiming to (i) understand the inner structure of the observed data and to (ii) predict missing data. We group those methods into supervised and unsupervised learning.

    • Clustering analysis

      In this part of MOOC you will learn what is the central goal of clustering analysis and how to reach it with non-hierarhical clustering using k-means algorithm. We explain the main steps of k-means and RHadoop implementation.

    • Summary

      Here you find a test and final discussion for week 4

  • Week 5

    Statistical learning with RHadoop: regression and classification

    • Linear regression

      In this part of MOOC we will explain the main idea of linear regression. After explaining the computational issues and the meaning of the results we will demonstrate how to perform this analysis for Big Data using RHadoop.

    • Classification

      In this part of the MOOC we study classification problem and how we approach it with linear discriminant analysis (LDA). We explain the main idea of LDA, its algorithmical inmplementation and how to perform it with R and RHadoop.

    • Summary

      Here the test related to week 5 and wrap up discussion are waiting.

    • Really big data examples

      Really big data example (by Marcin K) and Big electricity energy data (by Khyati S).

When would you like to start?

  • Date to be announced

Add to Wishlist to be emailed when new dates are announced

What will you achieve?

By the end of the course, you‘ll be able to...

  • Explore basic functionality of Apache Hadoop and of RHadoop
  • Experiment how to achieve performance of modern supercomputing
  • Experiment regression, clustering and classification with RHadoop
  • Investigate basic functionality of Bash terminal window
  • Knowledge about statistical learning to instances of data provided by edcators
  • How to do big data management with RHadoop on real supercomputer provided by University of Ljubljana

Who is the course for?

This course is designed for people interested in data science, computational statistics and machine learning and have basic experiences with them. It will be also useful for advanced undergraduate students and first year PhD students in data analysis, statistics or bioinformatics, who wish to understand how to manage big data with Hadoop using R programming language.

We expect that the learners will also have basic experiences with linux and bash and working experiences with R and matrix operations. They should be also capable to download and run virtual machine.

What software or tools do you need?

All software needed to actively participate the course is provided within the virtual machine that the followers are supposed to download and run on the local machine. No extra software is needed. You will need a modest local machine with 15GB free disk space and 2GB RAM.

Who will you learn with?

I am an active researcher in mathematical optimization, which has many applications in data science and where HPC is an inevitable tool.

Biljsna Mileva Boshkoska is an assistant professor in computer science. Her interests include decision support systems, data mining and working with big data.

Leon Kos is a 25+ years veteran of using Linux desktop on a daily basis to build digital relationships for research, teaching, and getting the job done by programming.

Who developed the course?

Partnership for Advanced Computing in Europe (PRACE)

The Partnership for Advanced Computing in Europe (PRACE) is an international non-profit association with its seat in Brussels.

Supporters

supported by

The University of Ljubljana logo

Learning on FutureLearn

Your learning, your rules

  • Courses are split into weeks, activities, and steps, but you can complete them as quickly or slowly as you like
  • Learn through a mix of bite-sized videos, long- and short-form articles, audio, and practical activities
  • Stay motivated by using the Progress page to keep track of your step completion and assessment scores

Join a global classroom

  • Experience the power of social learning, and get inspired by an international network of learners
  • Share ideas with your peers and course educators on every step of the course
  • Join the conversation by reading, @ing, liking, bookmarking, and replying to comments from others

Map your progress

  • As you work through the course, use notifications and the Progress page to guide your learning
  • Whenever you’re ready, mark each step as complete, you’re in control
  • Complete 90% of course steps and all of the assessments to earn your certificate

Want to know more about learning on FutureLearn? Using FutureLearn

Learner reviews

Learner reviews cannot be loaded due to your cookie settings. Please and refresh the page to view this content.

Do you know someone who'd love this course? Tell them about it...

You can use the hashtag #FLRhadoop to talk about this course on social media.