## Want to keep learning?

This content is taken from the Coventry University's online course, Get ready for a Masters in Data Science and AI. Join the course to learn more.
2.8

# Representing concepts mathematically

In Week 1, we looked at how to work with numbers, strings and Boolean data types in Python. In this step, we look at other data types commonly used in data science and artificial intelligence, including sets, images, and networks.

A set is simply a collection of elements. An element either is or is not a member of a set.

A digital image could be represented as a grid of pixels, where a colour or greyscale value is given for each pixel in the grid.

A network is a set of nodes and connections between nodes. For example, a social network (people and connections between people) or the internet (computers, switches, routers and satellites, and the physical links between them).

## Sets

You might remember from school the mathematical idea of a set of elements and the ideas of subsets, the union of sets and the intersection of sets.

Just as in mathematics, in Python we recognise a set from the curly brackets { }. The elements of a set have no order, ie there is no first element.

We can test for membership using the operators in and not in, as follows. The result from these operators is a Boolean (true or false).

kitchen = {'sink', 'oven', 'dishwasher'}
print(kitchen)
if 'oven' in kitchen:
print("Yes, an oven is in the kitchen")
print('toaster' not in kitchen)


We can add an element to a set and discard an element from a set (if it is present) as follows:

kitchen.add('fridge')
print(kitchen)
print(kitchen)


It is possible to test whether one set is a subset of another set (using issubset), and construct unions and intersections (using union and intersection).

bathroom = {'sink', 'shower', 'toilet'}
print(kitchen.issubset(bathroom))
house = kitchen.union(bathroom)
print(house)
print(kitchen.issubset(house))
mystery = kitchen.intersection(bathroom)
print(mystery)


The style of notation here is common in object-oriented programming, ie ‘variable dot action’ performs the action (eg .union) on the variable (eg kitchen).

## Images

To work with images in Python, we would use a specialist Python library, such as Pillow or OpenCV. For this section, we’re just going to look at how images are represented in a computer as a file in a file system.

A simple way to represent a black-and-white digital image is as a text file with a binary 1 for black or binary 0 for white for each pixel in the image. This is how the NetPBM format represents image files. The example below is of an 8-by-8 pixel black-and-white image of a smiley face.

P1
# PBM format example
8 8
0 0 1 1 1 1 0 0
0 1 0 0 0 0 1 0
1 0 1 0 0 1 0 1
1 0 0 0 0 0 0 1
1 0 1 0 0 1 0 1
1 0 0 1 1 0 0 1
0 1 0 0 0 0 1 0
0 0 1 1 1 1 0 0


Extending this idea to greyscale images, we can represent the greyscale value of each pixel as an 8-bit whole number (0 to 255) or perhaps a 16-bit whole number (0 to 65535).

For colour images, each pixel is represented in the RGB colour model by a red value, a green value, and a blue value. If each of red/green/blue is represented by an 8-bit integer (0 to 255), then a single pixel is represented by 24-bits, giving $$256^3=16777216$$ possible colours.

In the world of web design, a colour can be specified using six hexadecimal digits. For example, ‘orange’ has RGB values (255,165,0) which is given as FFA500 in hexadecimal.

Be aware that most image formats use data compression to reduce file size, which can, in some cases, impact image quality.

## Networks

Have you heard of the Six Degrees of Kevin Bacon? This is a challenge to connect the well-known actor Kevin Bacon with any other actor or actress through a chain of connections, ie a movie they appeared in together. The idea is to find the shortest possible chain of connections. Shortest path algorithms (such as the Floyd-Warshall algorithm) efficiently calculate shortest paths in such networks.

This kind of connection data is represented mathematically as a network. One way we can represent a network in Python is as an adjacency matrix. A matrix is simply a two-dimensional array (or grid) of numbers arranged into rows and columns. In Python, the NumPy library provides the two-dimensional array data structure and operations.

Continuing with the Six Degrees of Kevin Bacon idea, in an adjacency matrix, each row and column represents a particular actor or actress. For example, row 1, column 1 represents Kevin Bacon; row 2, column 2 represents Meryl Streep, who appeared with Kevin Bacon in The River Wild (1994). As a result of the connection, we’d place a 1 in row 1 and column 2 of the matrix. If Kevin Bacon and Meryl Streep had not appeared in any movies together, we would put a 0 in row 1 and column 2 of the adjacency matrix.

 Column 1 (Kevin Bacon) Column 2 (Meryl Streep) Column 3 (Anne Hathaway) Row 1 (Kevin Bacon) 0 1 0 Row 2 (Meryl Streep) 1 0 1 Row 3 (Anne Hathaway) 0 1 0

This is just one way we can capture connections data mathematically.

Analysis of this kind of data is of interest to social scientists studying social networks, such as a network of friends on Facebook or a network of co-authors of academic papers.

We have seen that it is possible to capture different kinds of data in a computer using appropriate mathematical ideas. There are many other types of data with very interesting names, such as recording where and when something occurs – spatiotemporal data – and biological data about genomes – genomic data. Each different type of data involves unique challenges in coming up with a standard way to encode it for storing and processing on a computer system.