Skip main navigation

Loading and reading text data in Python

Article discussing loading and reading text data in Python.

Due to its simple syntax for interacting with files and intuitive data structures, Python has become the go-to language for text data. Pandas has several functions for reading tabular data as a DataFrame object.

Some of these functions include:

  1. read_csv()
    Load delimited data from a file, URL, or file-like object.
    ‘,’ – the comma is the default delimiter.
  2. read_table()
    + Load delimited data from a file, URL, or file-like object.
    + ‘t’ – tab is the default delimiter.
  3. read_fwf()
    + Read data in fixed-width column format as there is no delimiter.

In this short course, you will most commonly use the read_csv() function.

Functions for CSV files

A CSV (comma-separated values) file is a type of plain text file. These CSV files are how columns and fields are identified in the value.

For example:

ColumnA_header, ColumnB_header, ColumC_header

ColumnA_value1, ColumnB_value2, ColumnC_value2

ColumnA_value2, ColumB_value2, ColumnC_value3

CSV’s read operation

For the following tasks, you first need to import both the NumPy and Pandas libraries. Next, enter the code snippet that demonstrates the use of the read_csv function:

Ensure that you have downloaded the dataset provided that contains the required files for running the next set of code.

Code:

import pandas as pd
import numpy as np
filename = "./dataset/sample_data.txt"
df= pd.read_csv(filename)
df

Output:

Screenshot of Jupyter Notebook module directory. The vertical axis of the table record 0-4, and the horizontal records abcd, and comments.Click to enlarge

Based on these results, you can observe the following:

  • The first row of the file has been considered as the name of the columns by default.
  • The default row labels (row indexes) have been considered (look at the values 0,1,2,3,4).
  • Parsing of the values has happened automatically (i.e. we didn’t have to specify the ’,’ as the delimiter explicitly).

Specifying the column names

Tables in Excel generally come with the header rows that contain the information that either identifies the content of a particular column or the number of the column. There are scenarios where you may have to explicitly specify whether the header row in a table exists or not.

The parameter header controls this behaviour. So, if you pass the value header=None, the first row will not be considered as the header and will be considered a data record instead. In this case, column names for the DataFrame will automatically generate.

Screenshot of Python reading a CSV file for its first row in CSV formatClick to enlarge

Next, we will read this file and instruct the read_csv function to consider the first row as the data record.

Code:

filename = "./dataset/sample_data_noheader.txt"
df= pd.read_csv(filename, header=None)
df

Output:

Screenshot of Python reading a CSV file for its first row in table format.Click to enlarge

In this particular scenario, let’s say we have a requirement to explicitly provide the column names instead of relying on the auto-indexing for column names.

In that case, we use the function’s parameter names and pass the list of column names to the read_csv() function. When we pass the names=[list of column names], we don’t have to pass the parameter header=None to the read_csv() function.

The code snippet demonstrates this:

Code:

names=['a','b','c','d','comments']
df=pd.read_csv(filename, names=names)
df

Output:

Screenshot of Python reading a CSV file for the names of columns in the file.Click to enlarge

Specifying the row labels/row index from a column in the data file

Let’s say you now have your columns labelled with individual header rows such as mango, apple, and grape. You might then decide to explicitly name that row ‘fruits’. Python comes in handy in such scenarios where you would like to explicitly specify the row labels (row indexes) instead of using the default indexes.

In the current example, assume that we want the comment section to be the row label of the DataFrame. This can be achieved by using the index_col parameter of the function and specifying the name of the column to be used as the row label.
The code snippets demonstrate this function.

Code:

df=pd.read_csv(filename, names=names, index_col='comments')
df

Output:

Screenshot of Python reading a CSV file for information in its first column.Click to enlarge

Code:

df.loc['comment1']

Output:

Screenshot of Python reading a CSV file for information in its second column.Click to enlarge

Based on these results, you can observe:

  • the comment column from the input data file has now been used to specify the row labels of the DataFrame
  • that you can access the first row of the DataFrame using the key 'comment1'.

Hierarchical indexing

Let’s say you want to include a hierarchical indexing functionality. Hierarchical indexing means that instead of one column (‘fruit’ column) considered as an index, you have a hierarchy of columns considered as indexes (such as fruit varieties – Tropical and Exotic). These indexes can have the same values as mango, apple, and grape, but now they are specific to the index names and can either be a tropical mango, apple, and grape or an exotic mango, apple, and grape.

Screenshot of Notepad file showing sample data hierarchy.Click to enlarge

In this particular case, the first two columns together can be considered as the row index. For such a scenario, we would have to pass the list of columns to be considered as a hierarchical index to the read_csv() function, using the index_col parameter.
The code snippets demonstrate using the index_col parameter.

Code:

filename = "./dataset/sample_data_hierarchy.txt"
names=['I1','I2','col1','col2','col3','col4','comments']
df= pd.read_csv(filename, names=names, index_col=['I1','I2'])
df

Output:

Screenshot of python command to organise sample data set into columns.Click to enlarge

Code:

df.loc['A',1]

Output:
col1 0
col2 1
col3 2
col4 3
comments comment1
Name: (A, 1), dtype: object

Screenshot of python command to organise sample data set into columnsClick to enlarge

Based on these results, you can observe that:

  • we have read the data file, specified the column names bypassing the list of column names to the parameter names
  • we have also specified the hierarchical indexing to be used by passing the list of column names to be considered as indexes to the parameter index_col
  • we are accessing the first row of the DataFrame using the hierarchical index 'A',1.

Next, you will engage in a practical activity to load data from a CSV file to a Pandas DataFrame.

This article is from the free online

Introduction to Data Analytics with Python

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now