Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Real-Time Processing with HDInsight

In this step, we will look at real-time processing options in Azure HDInsight.

In the previous step, we learned how batch processing we can use some of the Apache open-source technology when working in Azure HDInsight. Now let’s look at real-time applications.

Apache Kafka

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing.

Apache Storm

Apache Storm is a free and open-source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm can be used with any programming language.

Apache Storm is scalable, fault-tolerant, and useful for realtime analytics, online machine learning, continuous computation, distributed Remote Procedure Call (RPC), and extract, transform, load (ETL) workflows.

Apache Storm integrates with commonplace queueing and database technologies. An Apache Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation as required.

HBase NoSQL datastore

Apache HBase is an open-source, NoSQL database that’s built on Apache Hadoop and modelled after Google BigTable. HBase provides random access and strong consistency for large amounts of data in a schemaless database. The database is organised by column families.

From the user perspective, HBase is similar to a database. Data is stored in the rows and columns of a table, and data within a row is grouped by column family. HBase is a schemaless database.

The columns and data types can be undefined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.

Note: If you’d like to know more about open-source technology, take a look at the articles in the See also section below.

In the next activity, you’ll have a chance to practically engage with the concepts discussed in the Processing Big Data CloudSwyft Hands-On Lab. Once you’ve completed the lab, gauge your understanding of the content in the Knowledge Check that follows.

This article is from the free online

Microsoft Future Ready: Fundamentals of Big Data

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now