Skip main navigation

Data Formats

Graeme Malcolm discusses common formats in which data may be found or stored.

In this step, we’ll learn about the common forms and file types that data can take when stored in a file.

Common Formats of Data

Unstructured data

Unstructured data doesn’t have a predefined data schema and isn’t organised in a predefined manner. Unstructured information is typically text-heavy but may contain other data such as dates.

Structured data

Delimitation

Often found as Comma Separated Value (CSV) files, the delimitation could be a comma, space, or a tab within these files. Often the top row represents the headers of a column in a table, simulating the structure of the data in a relational database.

Relational Databases

Relational databases store data in a predefined schema, generally following a column/row format. We’ll go into further detail this week.

Semi-structured data

Semi-structured data is data that doesn’t follow the tabular structure of relational databases or delimitated files. The advantage is that its structure is often more flexible, but this can also lead to the disadvantage known as garbage in, garbage out. By not having to worry as much about the structure of the data, the system might receive more data than is valuable, thereby making the overall datastore bloated.

Extensible Markup Language

Extensible Markup Language (XML) files were popularised by web-applications that followed the SOAP principle. These files have hierarchy rather than just rows and columns, and are considered to be at least somewhat human-readable. The structure which is parsable by programmes can be flexible, allowing for change as needed.

JSON

JavaScript Object Notation (JSON) is an open standard that’s natively parsable by languages such as JavaScript, and was developed as an alternative to XML. It’s an efficient format that also uses human-readable text and has become the basis of storage for some modern databases, such as MongoDB.

File formats

In this step, you learned about text-based formats for data, including delimited text, XML, and JSON, which are commonly used in data processing scenarios. It also discussed file encoding, and the need to be aware of which encoding system is used for your data files.

In addition to the text formats covered in this step, there are some specialised file formats that are optimised for high performance in the distributed storage and processing environments typical of big data scenarios. These include formats such as Avro, ORC and Parquet. To learn more about them, explore the links in the See also section below.

In the next step, we will be given an introduction to the encoding of data.

Join the discussion

When using CSV files, why do you think there’s an option for tab, comma, or space delimitation?

Use the Discussion section below and let us know your thoughts. Try to respond to at least one other post and once you’re happy with your contribution, click the Mark as complete button to move on to the next step.

This article is from the free online

Microsoft Future Ready: Fundamentals of Big Data

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now