Skip main navigation

New offer! Get 30% off your first 2 months of Unlimited Monthly. Start your subscription for just £29.99 £19.99. New subscribers only. T&Cs apply

Find out more

Merging Data Sets

Learn more about merging data sets.

Instead of simply concatenating two data sets, sometimes you might need to merge them. What’s the difference?

A concatenation means combining dataframes as additional rows or columns, regardless of the data values. However, suppose you need to integrate two separate dataframes where one contains the names of books and the other contains the authors of those books; then merging is a more powerful and flexible function to rely on. Using this function, combine categories or indices that are common to both dataframes.

Pandas has full-featured, high-performance, in-memory join operations, which are syntactically very similar to relational databases like SQL. Pandas provides a single function, merge(), as the entry point to all the standard database join operations between the dataframe objects or names series objects.

The syntax of merge() is:


pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('x', 'y'), copy=True, indicator=False, validate=None)

Parameters for the Merge() Function

  • left This defines the left-hand side set (data frame or series object).
  • __right__This defines the right-hand side set (data frame or series object).
  • on=None This defines the key (column name or index label) to join on. This key must be in both the right and left sets. The default value of this parameter is None. If this variable is not passed and both left_index and right_index are False, the operation infers the key for the join operation as the column intersection in the two data frames.
  • left_on These are the columns or index labels from the left set (dataframe or series) to use as keys.
  • right_on These are the columns or index labels from the right set (dataframe or series) to use as keys.
  • left_index If this Boolean value is True, use the index (row labels) from the left-hand set as the join keys.
  • right_index If this Boolean value is True, use the index (row labels) from the right-hand set as join keys.
  • how The possible values are ‘left’, ‘right’, ‘outer’, and ‘inner’ (the default). As the name suggests, it relates to left, right, outer, and inner joins.

Refer to this documentation for a detailed list of merge() parameters and their uses.

Go to: Merge, join, concatenate and compare [1].

References

1. Pandas User Guide: Merge, join, concatenate and compare. Pandas; [date unknown]. Available from: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

This article is from the free online

Data Wrangling and Ingestion using Python

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now