Skip main navigation

Merging Data Sets

Learn more about merging data sets.

Instead of simply concatenating two data sets, sometimes you might need to merge them. What’s the difference?

A concatenation means combining dataframes as additional rows or columns, regardless of the data values. However, suppose you need to integrate two separate dataframes where one contains the names of books and the other contains the authors of those books; then merging is a more powerful and flexible function to rely on. Using this function, combine categories or indices that are common to both dataframes.

Pandas has full-featured, high-performance, in-memory join operations, which are syntactically very similar to relational databases like SQL. Pandas provides a single function, merge(), as the entry point to all the standard database join operations between the dataframe objects or names series objects.

The syntax of merge() is:


pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('x', 'y'), copy=True, indicator=False, validate=None)

Parameters for the Merge() Function

  • left This defines the left-hand side set (data frame or series object).
  • __right__This defines the right-hand side set (data frame or series object).
  • on=None This defines the key (column name or index label) to join on. This key must be in both the right and left sets. The default value of this parameter is None. If this variable is not passed and both left_index and right_index are False, the operation infers the key for the join operation as the column intersection in the two data frames.
  • left_on These are the columns or index labels from the left set (dataframe or series) to use as keys.
  • right_on These are the columns or index labels from the right set (dataframe or series) to use as keys.
  • left_index If this Boolean value is True, use the index (row labels) from the left-hand set as the join keys.
  • right_index If this Boolean value is True, use the index (row labels) from the right-hand set as join keys.
  • how The possible values are ‘left’, ‘right’, ‘outer’, and ‘inner’ (the default). As the name suggests, it relates to left, right, outer, and inner joins.

Refer to this documentation for a detailed list of merge() parameters and their uses.

Go to: Merge, join, concatenate and compare [1].

References

1. Pandas User Guide: Merge, join, concatenate and compare. Pandas; [date unknown]. Available from: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

This article is from the free online

Data Wrangling and Ingestion using Python

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education