Why pandas merge on multiple columns gives wrong result (and how to fix it)

Incorrect results in pandas merge on multiple columns usually appear in real-world datasets coming from SQL exports, logs, or APIs, where duplicate values exist in the merge columns. This leads to a Cartesian product, often silently breaking downstream logic.


Quick Answer

Pandas merge on multiple columns gives wrong result when duplicate keys exist, creating a Cartesian product. Fix by removing or aggregating duplicates before merging.

TL;DR

  • Merge on multiple columns multiplies rows when duplicates exist
  • Expected behavior, not a pandas bug
  • Always validate merge cardinality explicitly
  • Aggregate or deduplicate before merging

Problem Example

import pandas as pd

df1 = pd.DataFrame({'id': [1,2], 'id2': [10,20], 'val': [100,200]})
df2 = pd.DataFrame({'id': [1,1,2], 'id2': [10,10,20], 'amt': [30,40,50]})
print(f"df1: {len(df1)} rows, df2: {len(df2)} rows")
merged = pd.merge(df1, df2, on=['id', 'id2'], how='left')
print(f"merged: {len(merged)} rows")
print(merged)
# Output: 4 rows instead of expected 2

Root Cause Analysis

Duplicate values in the merge columns of either DataFrame cause pandas to perform a cross join for each matching combination, creating a Cartesian product. This behavior is identical to SQL JOIN semantics and often surprises developers. Related factors:

  • Multiple rows with same key in either DataFrame
  • One-to-many relationship not handled
  • No validation on key uniqueness

How to Detect This Issue

# Check for duplicates in merge columns
dup_count_df1 = df1.duplicated(subset=['id', 'id2']).sum()
dup_count_df2 = df2.duplicated(subset=['id', 'id2']).sum()
print(f'Duplicates in df1: {dup_count_df1}, Duplicates in df2: {dup_count_df2}')

# Show duplicate keys
if dup_count_df1 > 0 or dup_count_df2 > 0:
    print(df1[df1.duplicated(subset=['id', 'id2'], keep=False)])
    print(df2[df2.duplicated(subset=['id', 'id2'], keep=False)])

Solutions

Solution 1: Remove duplicates before merge

df1_clean = df1.drop_duplicates(subset=['id', 'id2'], keep='first')
df2_clean = df2.drop_duplicates(subset=['id', 'id2'], keep='first')
merged = pd.merge(df1_clean, df2_clean, on=['id', 'id2'], how='left')

Solution 2: Aggregate duplicates

df1_agg = df1.groupby(['id', 'id2'])['val'].sum().reset_index()
df2_agg = df2.groupby(['id', 'id2'])['amt'].sum().reset_index()
merged = pd.merge(df1_agg, df2_agg, on=['id', 'id2'], how='left')

Solution 3: Validate during merge

merged = pd.merge(df1, df2, on=['id', 'id2'], how='left', validate='one_to_one')
# Raises MergeError if duplicates exist

Why validate Parameter Fails

Using validate='one_to_one' will raise a MergeError when duplicate keys exist in either DataFrame. This is not a bug — it is pandas protecting you from a many-to-one join that would silently multiply rows. If the relationship is expected to be one-to-many, use validate='one_to_many'. For many-to-one use validate='many_to_one'. For many-to-many, explicitly aggregate before merge.

Production-Safe Pattern

merged = pd.merge(df1, df2, on=['id', 'id2'], how='left', validate='one_to_one')
assert len(merged) == len(df1), 'Merge created unexpected rows'

Wrong Fixes That Make Things Worse

❌ Dropping duplicates after the merge: This hides the symptom but corrupts your data

❌ Using outer join ’to be safe’: This introduces NaN rows and breaks assumptions

❌ Ignoring row count changes: Always assert expected row counts after merge

Common Mistakes to Avoid

  • Not checking for duplicates in merge columns before merge
  • Using merge without understanding cardinality
  • Ignoring the validate parameter

Frequently Asked Questions

Q: Why does pandas merge create more rows than expected?

When duplicates exist in the merge columns, pandas creates a row for each combination (Cartesian product).

Q: Is this a pandas bug?

No. This behavior follows SQL join semantics. Pandas is correctly preserving relational cardinality.

Q: How do I prevent duplicate rows in pandas merge?

Remove duplicates from DataFrames using drop_duplicates() before merging, or use the validate parameter to catch the issue early.

Fix pandas merge raises MergeErrorFix pandas left join returns unexpected rowsFix pandas merge using index gives wrong resultFix pandas merge suffixes not working

Next Steps

After fixing this issue:

  1. Validate your merge with the validate parameter
  2. Add unit tests to catch similar issues
  3. Review related merge problems above