£199.99 £139.99 for one year of Unlimited learning. Offer ends on 14 November 2022 at 23:59 (UTC). T&Cs apply

Find out more
Real-Time Processing with HDInsight
Skip main navigation

Real-Time Processing with HDInsight

In this step, we will look at real-time processing options in Azure HDInsight.

In the previous step, we learned how batch processing we can use some of the Apache open-source technology when working in Azure HDInsight. Now let’s look at real-time applications.

Apache Kafka

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing.

Apache Storm

Apache Storm is a free and open-source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm can be used with any programming language.

Apache Storm is scalable, fault-tolerant, and useful for realtime analytics, online machine learning, continuous computation, distributed Remote Procedure Call (RPC), and extract, transform, load (ETL) workflows.

Apache Storm integrates with commonplace queueing and database technologies. An Apache Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation as required.

HBase NoSQL datastore

Apache HBase is an open-source, NoSQL database that’s built on Apache Hadoop and modelled after Google BigTable. HBase provides random access and strong consistency for large amounts of data in a schemaless database. The database is organised by column families.

From the user perspective, HBase is similar to a database. Data is stored in the rows and columns of a table, and data within a row is grouped by column family. HBase is a schemaless database.

The columns and data types can be undefined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.

Note: If you’d like to know more about open-source technology, take a look at the articles in the See also section below.

In the next activity, you’ll have a chance to practically engage with the concepts discussed in the Processing Big Data CloudSwyft Hands-On Lab. Once you’ve completed the lab, gauge your understanding of the content in the Knowledge Check that follows.

This article is from the free online

Microsoft Future Ready: Fundamentals of Big Data

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education