Due to its simple syntax for interacting with files and intuitive data structures, Python has become the go-to language for text data. Pandas has several functions for reading tabular data as a DataFrame object.

Some of these functions include:

1. read_csv()
Load delimited data from a file, URL, or file-like object.
‘,’ – the comma is the default delimiter.
2. read_table()
+ Load delimited data from a file, URL, or file-like object.
+ ‘t’ – tab is the default delimiter.
3. read_fwf()
+ Read data in fixed-width column format as there is no delimiter.

In this short course, you will most commonly use the read_csv() function.

## Functions for CSV files

A CSV (comma-separated values) file is a type of plain text file. These CSV files are how columns and fields are identified in the value.

For example:

ColumnA_value1, ColumnB_value2, ColumnC_value2

ColumnA_value2, ColumB_value2, ColumnC_value3

For the following tasks, you first need to import both the NumPy and Pandas libraries. Next, enter the code snippet that demonstrates the use of the read_csv function:

Ensure that you have downloaded the dataset provided that contains the required files for running the next set of code.

Code:

import pandas as pdimport numpy as npfilename = "./dataset/sample_data.txt"df= pd.read_csv(filename)df

Output:

Based on these results, you can observe the following:

• The first row of the file has been considered as the name of the columns by default.
• The default row labels (row indexes) have been considered (look at the values 0,1,2,3,4).
• Parsing of the values has happened automatically (i.e. we didn’t have to specify the ’,’ as the delimiter explicitly).

### Specifying the column names

Tables in Excel generally come with the header rows that contain the information that either identifies the content of a particular column or the number of the column. There are scenarios where you may have to explicitly specify whether the header row in a table exists or not.

The parameter header controls this behaviour. So, if you pass the value header=None, the first row will not be considered as the header and will be considered a data record instead. In this case, column names for the DataFrame will automatically generate.

Next, we will read this file and instruct the read_csv function to consider the first row as the data record.

Code:

filename = "./dataset/sample_data_noheader.txt"df= pd.read_csv(filename, header=None)df

Output:

In this particular scenario, let’s say we have a requirement to explicitly provide the column names instead of relying on the auto-indexing for column names.

In that case, we use the function’s parameter names and pass the list of column names to the read_csv() function. When we pass the names=[list of column names], we don’t have to pass the parameter header=None to the read_csv() function.

The code snippet demonstrates this:

Code:

names=['a','b','c','d','comments']df=pd.read_csv(filename, names=names)df

Output:

### Specifying the row labels/row index from a column in the data file

Let’s say you now have your columns labelled with individual header rows such as mango, apple, and grape. You might then decide to explicitly name that row ‘fruits’. Python comes in handy in such scenarios where you would like to explicitly specify the row labels (row indexes) instead of using the default indexes.

In the current example, assume that we want the comment section to be the row label of the DataFrame. This can be achieved by using the index_col parameter of the function and specifying the name of the column to be used as the row label.
The code snippets demonstrate this function.

Code:

df=pd.read_csv(filename, names=names, index_col='comments')df

Output:

Code:

df.loc['comment1']

Output:

Based on these results, you can observe:

• the comment column from the input data file has now been used to specify the row labels of the DataFrame
• that you can access the first row of the DataFrame using the key 'comment1'.

### Hierarchical indexing

Let’s say you want to include a hierarchical indexing functionality. Hierarchical indexing means that instead of one column (‘fruit’ column) considered as an index, you have a hierarchy of columns considered as indexes (such as fruit varieties – Tropical and Exotic). These indexes can have the same values as mango, apple, and grape, but now they are specific to the index names and can either be a tropical mango, apple, and grape or an exotic mango, apple, and grape.

In this particular case, the first two columns together can be considered as the row index. For such a scenario, we would have to pass the list of columns to be considered as a hierarchical index to the read_csv() function, using the index_col parameter.
The code snippets demonstrate using the index_col parameter.

Code:

filename = "./dataset/sample_data_hierarchy.txt"names=['I1','I2','col1','col2','col3','col4','comments']df= pd.read_csv(filename, names=names, index_col=['I1','I2'])df

Output:

Code:

df.loc['A',1]

Output:
col1 0
col2 1
col3 2
col4 3
Name: (A, 1), dtype: object

Based on these results, you can observe that:

• we have read the data file, specified the column names bypassing the list of column names to the parameter names
• we have also specified the hierarchical indexing to be used by passing the list of column names to be considered as indexes to the parameter index_col
• we are accessing the first row of the DataFrame using the hierarchical index 'A',1.

Next, you will engage in a practical activity to load data from a CSV file to a Pandas DataFrame.