Skip main navigation

Organising Data

Both classification and regression are problems that require a single output, either a label or a prediction. However, you can also use AI and machine learning to analyse and then organise huge sets of data, to find groups and connections that humans might not identify.

Both classification and regression are problems that require a single output, either a label or a prediction. However, you can also use AI and machine learning to analyse and then organise huge sets of data, to find groups and connections that humans might not identify.

Understanding Large Data Sets

The world is full of data – so much so that it can be difficult to know what to do with it. One solution to this problem is knowledge organisation. This involves cataloging and categorising data to better understand its structure, either to help decision making on future endeavours or to make data navigation more efficient.

The process is called knowledge organisation because it groups similar data together to better show the relationships in the data set (knowledge).

How Data is Organised

Every company will use knowledge organisation differently, because the data they hold will be different. There are, however, two main types of knowledge organisation: association and clustering.

Data Association

Association will find links between data points, specifically between features of data points. A feature is just a single variable in the data, whereas a data point is a collection of features. An association algorithm looking at medical records (data set) might spot that patients (data point) with kidney problems (feature) are also likely to have symptoms of malnourishment (feature); these relationships are called association rules.

The association rules create groups in the data that are linked by common features. In association, one data point can belong to many groups.

Data Clustering

Clustering algorithms will group similar data points together to form “clusters”. These algorithms examine the data points as a whole and will group them based on how similar they are, by comparing multiple features of these points. This splits your data into more understandable chunks. Examining the size of the clusters can tell you a lot about the makeup of your data.

A graph showing two clusters, one consisting of yellow points and one consisting of green points. The green points are spread over the upper left side of the graph. The yellow points are spread over the lower right side of the graph.

Clusters can either be exclusive or overlapping. In exclusive clusters, a data point can only belong to one cluster. In overlapping systems, data points can belong to two or more clusters. In this case, the algorithm will give percentage values for these points, representing how closely they fit each cluster.

Two overlapping clusters, one coloured green and another coloured yellow. The clusters overlap in the middle and some points are encapsulated by both, showing they belong to both clusters.

To show you an example of clustering in practice, I am going to use an example from the online video streaming service Netflix.

Data Clustering Example – Netflix

Netflix is an online streaming site, serving TV shows and movies to over 195 million users in 190 countries. They have huge amounts of data about the content they provide and the users who watch it. They also invest heavily in AI and ML to make the most of all that data.

Netflix uses ML in almost all facets of its business, but behind it all is a well-organised knowledge base.

Overlapping Clusters – Taste Communities

Viewers are grouped together in overlapping clusters that Netflix calls taste communities based around the type of content they like to watch.

Inside Netflix, they use thousands of labels to mark content. These include not only the format of the show, such as “TV sitcom”, “film”, and “documentary” but also more details about the content including its genre (such as “sci-fi”, “romance” or “action”), the actors involved, and specific plot points. Netflix also have huge amounts of data about the users themselves, from the content that they watch, how long they watch for, and even the type of device they use. The company uses all of this data as inputs for their knowledge organiser algorithm, which then groups users into taste communities with other users who share the same watching habits.

Netflix has created around 2000 different taste communities, meaning they are able to target these groups specifically with their content marketing. In some cases, this has led to criticism of Netflix tailoring their marketing in a way that misleads their viewers as to the content of the show.

What do they do with that knowledge?

Netflix takes advantage of the knowledge organisation of its users to personalise their experience using the app.

Recommendations — Netflix estimates 75 to 80% of its viewing time comes from in-app recommendations. By using your personal taste community affiliations, Netflix is able to show you content that others who share your tastes have enjoyed.

Artwork — Another personalisation is the artwork that Netflix uses for its shows. The artwork is scraped from video frames in the content and the options are shown to millions of users. Once they know which art generates the most clicks within your taste community, they stop experimenting and only use the succesful thumbnails.

Content creation — Netflix has commisioned content based on their taste communities. If they can see a large community who are interested in romantic comedies and another who love fantasy movies, they can commission a new show to serve both those audiences.

Recommendation algorithms

Now, I’d like you to think about other recommendation services.

  • What content recommenders similar to Netflix do you use most often?
  • What features of their content do you think they use to group their users?
This article is from the free online

Introduction to Machine Learning and AI

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education