Fix pandas drop_duplicates keeps wrong row

Why pandas drop_duplicates keeps the wrong row (and how to fix it)

Incorrect row retention in pandas drop_duplicates usually appears in real-world datasets from SQL exports or APIs, where the DataFrame contains duplicate index values. This leads to the wrong row being kept, often silently breaking downstream logic.

Quick Answer

Pandas drop_duplicates retains the wrong row when the DataFrame has duplicate index values, keeping the last occurrence by default. Fix by specifying the keep parameter.

TL;DR

Drop_duplicates keeps the wrong row when index has duplicates
Specify keep=‘first’ to retain the first occurrence
Or use keep=False to drop all duplicates

Problem Example

import pandas as pd

df = pd.DataFrame({'id': [1,2,2,3,3,3], 'val': [10,20,30,40,50,60]})
print(f"Original: {len(df)} rows")
df_drop = df.drop_duplicates()
print(f"After drop_duplicates: {len(df_drop)} rows")
print(df_drop)
# Output shows the last row for each duplicate id

Root Cause Analysis

The DataFrame contains duplicate index values. Pandas retains the last occurrence of each duplicate by default. This behavior follows standard SQL duplicate removal semantics, where specifying which row to keep is left to the user. Related factors:

Duplicate rows in the DataFrame
Default keep=‘last’ parameter
Not specifying the keep parameter

How to Detect This Issue

# Check for duplicates in DataFrame
dup_count = df.duplicated().sum()
print(f'Duplicates in DF: {dup_count}')

# Show duplicate rows
if dup_count > 0:
    print(df[df.duplicated(keep=False)])

Solutions

Solution 1: Specify keep=‘first’

df_drop_first = df.drop_duplicates(keep='first')

Solution 2: Specify keep=False

df_drop_all = df.drop_duplicates(keep=False)

Solution 3: Use subset parameter

df_drop_subset = df.drop_duplicates(subset='id', keep='first')

Why validate Parameter Fails

Using keep='first' will retain the first occurrence of each duplicate. If you want to drop all duplicates, use keep=False. This will prevent silent retention of duplicate rows.

Production-Safe Pattern

df_drop = df.drop_duplicates(keep='first')
assert len(df_drop) <= len(df), 'Drop_duplicates did not remove any rows'

Wrong Fixes That Make Things Worse

❌ Not using the keep parameter: This leaves the default behavior, which may not be what you want

❌ Using drop_duplicates without checking for duplicates: This wastes computation if there are no duplicates

❌ Not specifying the subset parameter: This can lead to incorrect duplicate removal if not all columns are considered

Common Mistakes to Avoid

Not specifying the keep parameter
Using default keep=‘last’ without understanding its implications
Not checking for duplicates before calling drop_duplicates

Frequently Asked Questions

Q: Why does pandas drop_duplicates keep the wrong row?

When the DataFrame contains duplicate index values, pandas retains the last occurrence by default. You can change this behavior by specifying the keep parameter.

Q: Is this a pandas bug?

No. This behavior follows standard SQL duplicate removal semantics. Pandas leaves it to the user to specify which row to keep.

Q: How do I keep the first occurrence of each duplicate?

Use the keep=‘first’ parameter in the drop_duplicates function.

→ Fix pandas merge using index gives wrong result → Fix pandas pivot_table returns unexpected results → Fix pandas merge many to many duplicates rows → Fix pandas left join returns unexpected rows

Next Steps

After fixing this issue:

Validate your merge with the validate parameter
Add unit tests to catch similar issues
Review related merge problems above

Why pandas drop_duplicates keeps the wrong row (and how to fix it)#

Quick Answer#

TL;DR#

Problem Example#

Root Cause Analysis#

How to Detect This Issue#

Solutions#

Solution 1: Specify keep=‘first’#

Solution 2: Specify keep=False#

Solution 3: Use subset parameter#

Why validate Parameter Fails#

Production-Safe Pattern#

Wrong Fixes That Make Things Worse#

Common Mistakes to Avoid#

Frequently Asked Questions#

Related Issues#

Next Steps#