Skip main navigation

RHadoop libraries rhdfs, rmr2 and plyrmr

Here we explain the basic commands for working with RHadoop, which are provided in libraries rhdfs, rmr2 and plyrmr.
The core of RHadoop consists of rhdfs, rmr2 and plyrmr libraries
© PRACE and Faculty of information studies in Novo Mesto

There are several RHadoop packages that we have to use in order to be able to connect R and Hadoop. Here we give a dense presentation of rhdfs, rmr2 and plyrmr together with available commands. Note that within this MOOC we demonstrate only few of those commands.

rhdfs

The library package rhdfs provides commands for file manipulation in terms of reading, writing and moving files. Namely, R is a programming language that offers data processing. The data themselves are stored in files in the Hadoop filesystem. The following commands are part of the rhdfs package:

  • File Manipulations
  • hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get

  • File Read/Write
  • hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file

  • Directory
  • hdfs.dircreate, hdfs.mkdir

  • Utility
  • hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists

  • Initialization
  • hdfs.init, hdfs.defaults

    rmr2

    The rmr2 package allows the Hadoop MapReduce facility to be used inside the R environment. The package documentation lists the following commands:

  • The big-data object
  • big.data.object

  • Backend-independent file manipulation
  • dfs.empty

  • Equijoins using map-reduce
  • equijoin

  • Read or write `R` objects from or to the file system
  • from.dfs

  • Important Hadoop settings in relation to rmr2
  • hadoop.settings

  • Create, project or concatenate key-value pairs
  • keyval

  • Create combinations of settings for flexible IO
  • make.input.format

  • MapReduce using Hadoop Streaming
  • mapreduce

  • Function to set and get package options
  • rmr.options

  • Sample large data sets
  • rmr.sample

  • Print a variable’s content
  • rmr.str

  • Functions to split a file over several parts or to merge multiple parts into one
  • scatter

  • Set the status and define and increment counters for a Hadoop job
  • status

  • Create map-and-reduce functions from other functions
  • to.map

    plyrmr

    This package aims to provide a wide palette of predefined operations to cover the basic data-manipulation needs. The documentation of the plyrmr package provides the following list of available commands:

  • data manipulation
  • bind.cols (add new columns), where(select rows), select(select columns), rbind, transmute (all of the above plus summaries)

  • summaries
  • transmute, sample, count.cols, quantile.cols, top.k, bottom.k

  • set operations
  • union, intersect, unique, merge

    transmute appears twice because it is a generalization over transform and summarize that allows us to increase or decrease the number of columns or rows, covering the need for multi-row summaries, flattening of data structures, etc.

    © PRACE and Faculty of information studies in Novo Mesto
    This article is from the free online

    Managing Big Data with R and Hadoop

    Created by
    FutureLearn - Learning For Life

    Reach your personal and professional goals

    Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

    Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

    Start Learning now