Why pandas drop_duplicates keeps the wrong row (and how to fix it)
Incorrect row retention in pandas drop_duplicates usually appears in real-world datasets from SQL exports or APIs, where the DataFrame contains duplicate index values. This leads to the wrong row being kept, often silently breaking downstream logic.
Quick Answer
Pandas drop_duplicates retains the wrong row when the DataFrame has duplicate index values, keeping the last occurrence by default. Fix by specifying the keep parameter.
TL;DR
- Drop_duplicates keeps the wrong row when index has duplicates
- Specify keep=‘first’ to retain the first occurrence
- Or use keep=False to drop all duplicates
Problem Example
import pandas as pd
df = pd.DataFrame({'id': [1,2,2,3,3,3], 'val': [10,20,30,40,50,60]})
print(f"Original: {len(df)} rows")
df_drop = df.drop_duplicates()
print(f"After drop_duplicates: {len(df_drop)} rows")
print(df_drop)
# Output shows the last row for each duplicate id
Root Cause Analysis
The DataFrame contains duplicate index values. Pandas retains the last occurrence of each duplicate by default. This behavior follows standard SQL duplicate removal semantics, where specifying which row to keep is left to the user. Related factors:
- Duplicate rows in the DataFrame
- Default keep=‘last’ parameter
- Not specifying the keep parameter
How to Detect This Issue
# Check for duplicates in DataFrame
dup_count = df.duplicated().sum()
print(f'Duplicates in DF: {dup_count}')
# Show duplicate rows
if dup_count > 0:
print(df[df.duplicated(keep=False)])
Solutions
Solution 1: Specify keep=‘first’
df_drop_first = df.drop_duplicates(keep='first')
Solution 2: Specify keep=False
df_drop_all = df.drop_duplicates(keep=False)
Solution 3: Use subset parameter
df_drop_subset = df.drop_duplicates(subset='id', keep='first')
Why validate Parameter Fails
Using keep='first' will retain the first occurrence of each duplicate. If you want to drop all duplicates, use keep=False. This will prevent silent retention of duplicate rows.
Production-Safe Pattern
df_drop = df.drop_duplicates(keep='first')
assert len(df_drop) <= len(df), 'Drop_duplicates did not remove any rows'
Wrong Fixes That Make Things Worse
❌ Not using the keep parameter: This leaves the default behavior, which may not be what you want
❌ Using drop_duplicates without checking for duplicates: This wastes computation if there are no duplicates
❌ Not specifying the subset parameter: This can lead to incorrect duplicate removal if not all columns are considered
Common Mistakes to Avoid
- Not specifying the keep parameter
- Using default keep=‘last’ without understanding its implications
- Not checking for duplicates before calling drop_duplicates
Frequently Asked Questions
Q: Why does pandas drop_duplicates keep the wrong row?
When the DataFrame contains duplicate index values, pandas retains the last occurrence by default. You can change this behavior by specifying the keep parameter.
Q: Is this a pandas bug?
No. This behavior follows standard SQL duplicate removal semantics. Pandas leaves it to the user to specify which row to keep.
Q: How do I keep the first occurrence of each duplicate?
Use the keep=‘first’ parameter in the drop_duplicates function.
Related Issues
→ Fix pandas merge using index gives wrong result → Fix pandas pivot_table returns unexpected results → Fix pandas merge many to many duplicates rows → Fix pandas left join returns unexpected rows
Next Steps
After fixing this issue:
- Validate your merge with the
validateparameter - Add unit tests to catch similar issues
- Review related merge problems above