Skip main navigation

Data Formats

Graeme Malcolm discusses common formats in which data may be found or stored.

In this step, we’ll learn about the common forms and file types that data can take when stored in a file.

Common Formats of Data

Unstructured data

Unstructured data doesn’t have a predefined data schema and isn’t organised in a predefined manner. Unstructured information is typically text-heavy but may contain other data such as dates.

Structured data

Delimitation

Often found as Comma Separated Value (CSV) files, the delimitation could be a comma, space, or a tab within these files. Often the top row represents the headers of a column in a table, simulating the structure of the data in a relational database.

Relational Databases

Relational databases store data in a predefined schema, generally following a column/row format. We’ll go into further detail this week.

Semi-structured data

Semi-structured data is data that doesn’t follow the tabular structure of relational databases or delimitated files. The advantage is that its structure is often more flexible, but this can also lead to the disadvantage known as garbage in, garbage out. By not having to worry as much about the structure of the data, the system might receive more data than is valuable, thereby making the overall datastore bloated.

Extensible Markup Language

Extensible Markup Language (XML) files were popularised by web-applications that followed the SOAP principle. These files have hierarchy rather than just rows and columns, and are considered to be at least somewhat human-readable. The structure which is parsable by programmes can be flexible, allowing for change as needed.

JSON

JavaScript Object Notation (JSON) is an open standard that’s natively parsable by languages such as JavaScript, and was developed as an alternative to XML. It’s an efficient format that also uses human-readable text and has become the basis of storage for some modern databases, such as MongoDB.

File formats

In this step, you learned about text-based formats for data, including delimited text, XML, and JSON, which are commonly used in data processing scenarios. It also discussed file encoding, and the need to be aware of which encoding system is used for your data files.

In addition to the text formats covered in this step, there are some specialised file formats that are optimised for high performance in the distributed storage and processing environments typical of big data scenarios. These include formats such as Avro, ORC and Parquet. To learn more about them, explore the links in the See also section below.

In the next step, we will be given an introduction to the encoding of data.

Join the discussion

When using CSV files, why do you think there’s an option for tab, comma, or space delimitation?

Use the Discussion section below and let us know your thoughts. Try to respond to at least one other post and once you’re happy with your contribution, click the Mark as complete button to move on to the next step.

This article is from the free online

Microsoft Future Ready: Fundamentals of Big Data

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education