Pandas
What is Pandas and what do we need it for?
To use the machine learning tools in scikit-learn you need to provide it with training data in a particular format, known as a features matrix. One Python package that is useful for dealing with data structures of this type is Pandas. In this article we will look more closely at what a features matrix is, and give a quick overview of the DataFrame object in Pandas that can be used to make and import feature matrices for use in machine learning models in scikit-learn.
What is a features matrix?
Put simply, a features matrix is a two-dimensional table with each row containing a single observation or data-point, and each column containing a value representing a particular feature of that data-point. A common example is a table of data in a spreadsheet. Provided each row represents a different observation or entity, and each column records a different feature associated with or measured within that observation, it can be used as a features matrix. In fact spreadsheets of this type are often imported into Python and used in exactly this way.
Alternatively, you may also have encountered the CSV or comma separated value format, that can be used to store data in the same way. These are text files in which the values in each row are separated by commas, and a new row is just a line break.
Python Pandas and DataFrames
As we will see, features matrices might contain several different data types, all of which need representing in Python in order for us to use them in our analysis. While arrays such as NumPy arrays are computationally fast, they can only contain one data type. On the other hand, while we could construct a features matrix using standard Python list or dictionary objects, which can contain any number of different data types, this is computationally very slow in practice.
The Pandas package offers a useful solution to this in the form of its DataFrame class. It allows you to construct tabular data with potentially different data types in each column, exactly as we’d want for a features matrix, but with better computational performance than just using Python lists.
If your feature columns have numerical data, you can also easily combine columns via arithmetical or logical operations (e.g. adding or multiplying two columns together) to make new features. This can be particularly useful for feature extraction, which we will discuss more later in the course.
Pandas Series
While the Pandas class for representing data in a tabular format is a DataFrame, the Pandas class for a one-dimensional list (sometimes known as a vector) of values is the Series class. Unlike standard Python lists Series must consist of a single data type (e.g. string or numeric). As with DataFrames, Pandas provides lots of convenient methods to combine Series to make new Series or select elements from a Series using logical operations.
You can think of a DataFrame as a set of Series of the same length grouped together to form a table. An example dataset included in scikit-learn that we will return to several times, particularly relevant to plant phenotyping, is the Iris dataset. In this dataset we have several numeric Series giving measurements of petal and sepal dimensions, and a Series containing strings giving the species of individual plants.
To train a machine learning model we might take only the columns containing numeric data to give us a features matrix. The remaining feature column giving the species name can then be extracted as a Series and used, along with the features matrix, to train the model.
How do we make and import DataFrames and Series?
You can make DataFrames from scratch using lists and dictionaries. For example:
import pandas as pd
list1 = [1.3, 1.4, 0.8, 2.6]
list2 = ['daisy', 'daisy', 'clover', 'daffodil']
df = pd.DataFrame(data = {'size':list1, 'species':list2})
print(df)
print()
print(df.dtypes)
and if you run the code above you should see something like this output:
size species
0 1.3 daisy
1 1.4 daisy
2 0.8 clover
3 2.6 daffodil
size float64
species object
dtype: object
This shows the DataFrame you’ve just made, and the data types for each column. As you can see Pandas automatically assigns the float type to numeric data.
However, a more common method than making DataFrames from scratch, especially for large datasets such as you would use for machine learning, is importing data from an external source such as a csv or spreadsheet file. Let’s say you have a file called ‘my_data.csv’. Then it’s as simple as the following code to import it as a Pandas DataFrame:
import pandas as pd
imp_df = pd.read_csv('my_data.csv')
print(imp_df)
Of course, you need to make sure the data contained in this file has data in the correct format, with every row containing the same set of features, for this DataFrame to be useful.
Importing the Iris dataset
To import the Iris dataset and convert it to a Pandas DataFrame you can use the following code:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
# Get the data and convert to DataFrame - its stored in iris.data
iris_df = pd.DataFrame(iris.data)
# Set the column names - these are stored as iris.feature_names
iris_df.columns = iris.feature_names
Then to display the first few rows of data, along with the row and column names you can use the head function as follows:
print(iris_df.head())
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
Alternative method
You can import the data directly as a Pandas DataFrame:
iris = load_iris(as_frame=True)
alt_df = iris.frame
print(alt_df.head())
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
Now, when you look at the data, you should see that the DataFrame also contains the target data in the far right hand column, named ‘target’. This data is a numerical code indicating the species of the flower in this row of the table.
To make a copy containing just the features you can use the drop function, with the name(s) of the column(s) you want to drop, in this case just ‘target’:
X = alt_df.drop('target', axis=1)
The keyword argument axis=1 is essential to drop the column rather than rows.
Finally, to separate out the target vector you can just use the column name, ‘target’:
y = alt_df['target']
Conclusions
As you can see, there are different ways to import and manipulate your data. Which one you choose depends on the context, as well as your personal preference. The important thing is to know what your data types are, and how to get them in the shape you want.
Try out these examples for yourself, either using a Python shell or making a script. We’ll see how we can use this data to train machine learning models later in the course.
There’s a complete version of the code with all the examples shown here in the Github repository: https://github.com/LAR/PhenoDataCampp/tree/main/MachineLearning
Reach your personal and professional goals
Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.
Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.
Register to receive updates
-
Create an account to receive our newsletter, course recommendations and promotions.
Register for free