## Want to keep learning?

This content is taken from the University of Leeds & Institute of Coding's online course, Evidence and Data Collection for Problem Solving. Join the course to learn more.
1.14

# How can datasets be combined?

In the last step we looked at a small extract of the data from Aisha’s restaurant. The data table showed one day’s worth of orders, for 1st January, 2019.

That small extract of data told us what was being ordered, and when. We got an overview of the kinds of dishes on Aisha’s menu, and how much they were selling for. We also got an idea of the sizes of orders that people place, by looking at the quantity of each dish, and the number of rows in each customer order.

But there were no real people in the data set - customers were just identified by a number. We couldn’t tell who ordered each order, where they lived, whether the order was delivered on time, or if there were any complaints.

The data inside a business can be quite complex, and it might take a lot of different data sets and sources to properly ‘model’ a business. For a business like Aisha’s, that might include datasets about customers, orders, dishes, ingredients, inventory, staff, advertising, deliveries and complaints. That data often lives in different systems, so often datasets need to be combined to get a full picture of the business.

Suppose Aisha wanted to know who her most frequent customers are, in order to send them a special offer through the post. We’d need to look at two different data sets: the order data table, and a data table that cross references the customer ID with names, addresses and contact details. In database software this kind of question can be answered with a ‘Join’ - a command that selects records using matching values in two data tables to answer queries across them both. The customer ID is the unique ‘key value’ that allows us to index the sets of data against each other.

It is possible to combine data from different sources if you can find a reliable way of matching values between them. For example, Aisha’s restaurant data includes her stock management spreadsheet for ingredients. She could look at her stock levels, the average daily sales for each dish, and combine that data with the ingredients used in her recipes to predict when she would need to reorder her halloumi so she doesn’t run out.

Aisha could also use external data sources to help her business. The ingredients she buys will have barcodes on them, which are EAN-13 product identification codes. She could use these standard codes to query an online database that contains allergen or nutrition information, and then provide better information about the food she sells for people with allergies or dietary requirements.

Combining data from different datasets can be extremely commercially valuable.

Credit agencies combine data from banks, the Electoral Roll, search histories, the Postcode Address file, and court records around CCJs and insolvency rulings to identify an individual and give them a credit score. Other businesses then pay to access that database, and use the score to assess whether to approve a person for financial services like mobile phone contracts.