Skip main navigation

Data ingestion: Transporting data from source to storage

Learn how to ingest data by transporting it from source to storage.

Through the video, you would have learned about what data ingestion is, the different sources of data, the two approaches to data ingestion, and the three basic actions in data ingestion.

Data ingestion in the CRISP-DM process

As we learned earlier, the first phase of CRISP -DM is to understand data. Before you can analyse and understand data, you need to collect it from data sources that relate to the business problem. The initial phase of data collection is usually called ‘data ingestion’.

Diagram showing the CRISP-DM process which is made up of six steps. 'Business Understanding' links to 'Data Understanding' with a two way arrow. 'Data Understanding' is linked to 'Data Preparation' with a one-way arrow. 'Data Preparation' is linked to 'Modeling' with a two-way arrow. 'Modeling' is linked to 'Evaluation' with a one-way arrow. 'Evaluation' is linked to 'Deployment' with a one-way arrow, and also to 'Business Understanding with a one-way arrow. In the middle of all the steps is 'Data'.

Data ingestion is the process of reading and loading data from various underlying data sources into Python. From there, the data can be processed and transformed as per the requirements of the application. Each kind of data source has its own protocol for transferring data and, as a data analyst, you must be aware of these protocols.

Most times, the data is available to us in the following formats:

  • Text data (CSV, JSON, Excel, etc.)
  • Web data (HTML, XML)
  • Databases (SQL and NoSQL Data)
  • Binary data formats

Note: Download data set containing the data sets as .csv and .txt files that you will require in this unit. Be sure to extract the files and save them individually in the same folder as your Jupyter Notebooks. All the data in the zip file is sourced from this Github link. [1]

Next, let us learn how to ingest (i.e. load and read) text data in Python.

References

  1. mwaskom. seaborn-data [Internet]. GitHub; 2021 Jun 23. Available from: https://github.com/mwaskom/seaborn-data
This article is from the free online

Introduction to Data Analytics with Python

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education