# What is Data Classification?

Classification problems involve us supplying data to a computer for it to then allocate it to a label or a class. Suppose that an online retailer wanted a system that could quickly work out whether or not reviews on a product are positive or negative. To solve this problem, the retailer can use a text-classifying algorithm to look for positive and negative comments.
Classification problems involve us supplying data to a computer for it to then allocate it to a label or a class.
Suppose that an online retailer wanted a system that could quickly work out whether or not reviews on a product are positive or negative. To solve this problem, the retailer can use a text-classifying algorithm to look for positive and negative comments.
So that the text classifier can separate out positive and negative reviews, it first needs some training data. The retailer would need to feed in past reviews and label each one as positive or negative. The algorithm will use this training data to create a decision boundary to separate the positive reviews from the negative reviews.
When a user submits a new review, the algorithm will classify it depending on which side of the decision boundary it falls on.
What I have described here is a form of supervised learning as it requires a supervisor (human user) to create classes and to label the training data, so that the algorithm can independently classify new data.

## Types of Data Classification

### Binary Classification

Binary classification involves splitting items into only two classes. The example above was binary classification, as it split the reviews into “positive” and “negative”. Another example of binary classification is spam filtering, as an email is either classified as “spam” or “not spam”.

### Multi-Class Classification

This is a classification algorithm that allows for more than two classes. During the labelling process, each data sample is only assigned to a single label. For example, a recycling centre needs to categorise each item of waste by taking photographs of the waste travelling down a conveyor belt. Rather than categorising an item as recyclable or non-recyclable, a multi-class classification model allows a wider range of classes, such as glass, plastic, paper or cardboard.

### Multi-Label Classification

This can be used for problems where a single data point can have more than one class. For example, a person categorising images of animals can label a picture of a brown bear with multiple labels such as “brown animal”, “furry” and “bear”.
In effect, these systems make multiple binary classification predictions for each piece of data.

## Types of Data Classifier

The examples above show a few ways that different types of data, text and images can be classified. Other classifiers will use these types of data for different purposes or classify other types of data such as sound.

### Image Classifiers

Image classification involves classifying the content of images. As well as the examples already given, common uses include:
• Facial recognition
• Handwriting recognition
• Helping to identify abnormalities on medical images

### Text Classifiers

Text classifiers analyse natural language to identify classes by using longer pieces of text, rather than just picking out common keywords. Common uses include:
• Topic analysis to identify themes or the topics present in text. For example, a chatbot that provides support to customers will need to “classify” the issue that the customer is having, so that it can then provide appropriate support such as directing the customer to the relevant webpage.
• Sentiment analysis (to detect if the language has a positive or negative tone).
• Spam detection.

### Sound Classifiers

Sound classifiers classify groups of similar or identical sounds. Voice recognition is a common example where systems are trained to recognise individual voices or to recognise commands such as in home automation systems. Another example could be to help research scientists identify species of bird present in an area by recording birdsong. BirdNet is a project where you can try this out for yourself on the BirdNet project website.

### Limitations of Data Classification

It’s worth noting that the main limitations of using classification algorithms for machine learning are that the accuracy of the prediction depends on:
• The existence of the appropriate labels. If data is fed into the system that doesn’t fit into one of the categories, then the prediction will always be inaccurate.
• The training data all being correctly labelled. In some instances, this might be a subjective decision (such as in the case of sentiment analysis). The more subjective the training data, the more likely any new data will be incorrectly classified.
• Using a good spread of training data that represents the full range of inputs that the model will have to classify.
In essence, creating lots of high quality training data takes a lot of time and effort.

Pick one of the types of classification (images, sound, or text), and find another example online of a use for these types of algorithms.
• Explain the example as best you can
• What classes do you think it identifies?