Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Data ingestion: Transporting data from source to storage

Learn how to ingest data by transporting it from source to storage.

Through the video, you would have learned about what data ingestion is, the different sources of data, the two approaches to data ingestion, and the three basic actions in data ingestion.

Data ingestion in the CRISP-DM process

As we learned earlier, the first phase of CRISP -DM is to understand data. Before you can analyse and understand data, you need to collect it from data sources that relate to the business problem. The initial phase of data collection is usually called ‘data ingestion’.

Diagram showing the CRISP-DM process which is made up of six steps. 'Business Understanding' links to 'Data Understanding' with a two way arrow. 'Data Understanding' is linked to 'Data Preparation' with a one-way arrow. 'Data Preparation' is linked to 'Modeling' with a two-way arrow. 'Modeling' is linked to 'Evaluation' with a one-way arrow. 'Evaluation' is linked to 'Deployment' with a one-way arrow, and also to 'Business Understanding with a one-way arrow. In the middle of all the steps is 'Data'.

Data ingestion is the process of reading and loading data from various underlying data sources into Python. From there, the data can be processed and transformed as per the requirements of the application. Each kind of data source has its own protocol for transferring data and, as a data analyst, you must be aware of these protocols.

Most times, the data is available to us in the following formats:

  • Text data (CSV, JSON, Excel, etc.)
  • Web data (HTML, XML)
  • Databases (SQL and NoSQL Data)
  • Binary data formats

Note: Download data set containing the data sets as .csv and .txt files that you will require in this unit. Be sure to extract the files and save them individually in the same folder as your Jupyter Notebooks. All the data in the zip file is sourced from this Github link. [1]

Next, let us learn how to ingest (i.e. load and read) text data in Python.


References

  1. mwaskom. seaborn-data [Internet]. GitHub; 2021 Jun 23. Available from: https://github.com/mwaskom/seaborn-data
This article is from the free online

Introduction to Data Analytics with Python

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now