Why pandas merge on multiple columns gives wrong result (and how to fix it)
Incorrect results in pandas merge on multiple columns usually appear in real-world datasets coming from SQL exports, logs, or APIs, where duplicate values exist in the merge columns. This leads to a Cartesian product, often silently breaking downstream logic.
Quick Answer
Pandas merge on multiple columns gives wrong result when duplicate keys exist, creating a Cartesian product. Fix by removing or aggregating duplicates before merging.
TL;DR
- Merge on multiple columns multiplies rows when duplicates exist
- Expected behavior, not a pandas bug
- Always validate merge cardinality explicitly
- Aggregate or deduplicate before merging
Problem Example
import pandas as pd
df1 = pd.DataFrame({'id': [1,2], 'id2': [10,20], 'val': [100,200]})
df2 = pd.DataFrame({'id': [1,1,2], 'id2': [10,10,20], 'amt': [30,40,50]})
print(f"df1: {len(df1)} rows, df2: {len(df2)} rows")
merged = pd.merge(df1, df2, on=['id', 'id2'], how='left')
print(f"merged: {len(merged)} rows")
print(merged)
# Output: 4 rows instead of expected 2
Root Cause Analysis
Duplicate values in the merge columns of either DataFrame cause pandas to perform a cross join for each matching combination, creating a Cartesian product. This behavior is identical to SQL JOIN semantics and often surprises developers. Related factors:
- Multiple rows with same key in either DataFrame
- One-to-many relationship not handled
- No validation on key uniqueness
How to Detect This Issue
# Check for duplicates in merge columns
dup_count_df1 = df1.duplicated(subset=['id', 'id2']).sum()
dup_count_df2 = df2.duplicated(subset=['id', 'id2']).sum()
print(f'Duplicates in df1: {dup_count_df1}, Duplicates in df2: {dup_count_df2}')
# Show duplicate keys
if dup_count_df1 > 0 or dup_count_df2 > 0:
print(df1[df1.duplicated(subset=['id', 'id2'], keep=False)])
print(df2[df2.duplicated(subset=['id', 'id2'], keep=False)])
Solutions
Solution 1: Remove duplicates before merge
df1_clean = df1.drop_duplicates(subset=['id', 'id2'], keep='first')
df2_clean = df2.drop_duplicates(subset=['id', 'id2'], keep='first')
merged = pd.merge(df1_clean, df2_clean, on=['id', 'id2'], how='left')
Solution 2: Aggregate duplicates
df1_agg = df1.groupby(['id', 'id2'])['val'].sum().reset_index()
df2_agg = df2.groupby(['id', 'id2'])['amt'].sum().reset_index()
merged = pd.merge(df1_agg, df2_agg, on=['id', 'id2'], how='left')
Solution 3: Validate during merge
merged = pd.merge(df1, df2, on=['id', 'id2'], how='left', validate='one_to_one')
# Raises MergeError if duplicates exist
Why validate Parameter Fails
Using validate='one_to_one' will raise a MergeError when duplicate keys exist in either DataFrame. This is not a bug — it is pandas protecting you from a many-to-one join that would silently multiply rows. If the relationship is expected to be one-to-many, use validate='one_to_many'. For many-to-one use validate='many_to_one'. For many-to-many, explicitly aggregate before merge.
Production-Safe Pattern
merged = pd.merge(df1, df2, on=['id', 'id2'], how='left', validate='one_to_one')
assert len(merged) == len(df1), 'Merge created unexpected rows'
Wrong Fixes That Make Things Worse
❌ Dropping duplicates after the merge: This hides the symptom but corrupts your data
❌ Using outer join ’to be safe’: This introduces NaN rows and breaks assumptions
❌ Ignoring row count changes: Always assert expected row counts after merge
Common Mistakes to Avoid
- Not checking for duplicates in merge columns before merge
- Using merge without understanding cardinality
- Ignoring the validate parameter
Frequently Asked Questions
Q: Why does pandas merge create more rows than expected?
When duplicates exist in the merge columns, pandas creates a row for each combination (Cartesian product).
Q: Is this a pandas bug?
No. This behavior follows SQL join semantics. Pandas is correctly preserving relational cardinality.
Q: How do I prevent duplicate rows in pandas merge?
Remove duplicates from DataFrames using drop_duplicates() before merging, or use the validate parameter to catch the issue early.
Related Issues
→ Fix pandas merge raises MergeError → Fix pandas left join returns unexpected rows → Fix pandas merge using index gives wrong result → Fix pandas merge suffixes not working
Next Steps
After fixing this issue:
- Validate your merge with the
validateparameter - Add unit tests to catch similar issues
- Review related merge problems above