Why pandas drop_duplicates keeps the wrong row (and how to fix it)

Incorrect row retention in pandas drop_duplicates usually appears in real-world datasets from SQL exports or APIs, where the DataFrame contains duplicate index values. This leads to the wrong row being kept, often silently breaking downstream logic.


Quick Answer

Pandas drop_duplicates retains the wrong row when the DataFrame has duplicate index values, keeping the last occurrence by default. Fix by specifying the keep parameter.

TL;DR

  • Drop_duplicates keeps the wrong row when index has duplicates
  • Specify keep=‘first’ to retain the first occurrence
  • Or use keep=False to drop all duplicates

Problem Example

import pandas as pd

df = pd.DataFrame({'id': [1,2,2,3,3,3], 'val': [10,20,30,40,50,60]})
print(f"Original: {len(df)} rows")
df_drop = df.drop_duplicates()
print(f"After drop_duplicates: {len(df_drop)} rows")
print(df_drop)
# Output shows the last row for each duplicate id

Root Cause Analysis

The DataFrame contains duplicate index values. Pandas retains the last occurrence of each duplicate by default. This behavior follows standard SQL duplicate removal semantics, where specifying which row to keep is left to the user. Related factors:

  • Duplicate rows in the DataFrame
  • Default keep=‘last’ parameter
  • Not specifying the keep parameter

How to Detect This Issue

# Check for duplicates in DataFrame
dup_count = df.duplicated().sum()
print(f'Duplicates in DF: {dup_count}')

# Show duplicate rows
if dup_count > 0:
    print(df[df.duplicated(keep=False)])

Solutions

Solution 1: Specify keep=‘first’

df_drop_first = df.drop_duplicates(keep='first')

Solution 2: Specify keep=False

df_drop_all = df.drop_duplicates(keep=False)

Solution 3: Use subset parameter

df_drop_subset = df.drop_duplicates(subset='id', keep='first')

Why validate Parameter Fails

Using keep='first' will retain the first occurrence of each duplicate. If you want to drop all duplicates, use keep=False. This will prevent silent retention of duplicate rows.

Production-Safe Pattern

df_drop = df.drop_duplicates(keep='first')
assert len(df_drop) <= len(df), 'Drop_duplicates did not remove any rows'

Wrong Fixes That Make Things Worse

❌ Not using the keep parameter: This leaves the default behavior, which may not be what you want

❌ Using drop_duplicates without checking for duplicates: This wastes computation if there are no duplicates

❌ Not specifying the subset parameter: This can lead to incorrect duplicate removal if not all columns are considered

Common Mistakes to Avoid

  • Not specifying the keep parameter
  • Using default keep=‘last’ without understanding its implications
  • Not checking for duplicates before calling drop_duplicates

Frequently Asked Questions

Q: Why does pandas drop_duplicates keep the wrong row?

When the DataFrame contains duplicate index values, pandas retains the last occurrence by default. You can change this behavior by specifying the keep parameter.

Q: Is this a pandas bug?

No. This behavior follows standard SQL duplicate removal semantics. Pandas leaves it to the user to specify which row to keep.

Q: How do I keep the first occurrence of each duplicate?

Use the keep=‘first’ parameter in the drop_duplicates function.

Fix pandas merge using index gives wrong resultFix pandas pivot_table returns unexpected resultsFix pandas merge many to many duplicates rowsFix pandas left join returns unexpected rows

Next Steps

After fixing this issue:

  1. Validate your merge with the validate parameter
  2. Add unit tests to catch similar issues
  3. Review related merge problems above