Pandas groupby size vs count: cause and resolution
Unexpected row totals in pandas groupby size usually appear in production ETL pipelines that read CSV exports from databases, where the grouped column contains missing values. This causes size to count all rows while count skips NaNs, silently breaking downstream metrics.
# Example showing the issue
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B', 'B'],
'value': [1, np.nan, 3, np.nan, 5]
})
size = df.groupby('group').size()
count = df.groupby('group')['value'].count()
print(f"size rows: {len(size)}; count rows: {len(count)}")
print('size:')
print(size)
print('count:')
print(count)
# Output shows size A=2, B=3 while count A=1, B=2
NaN values in the aggregated column are excluded by .count() but retained by .size(), so the two aggregations report different totals. This behavior follows pandas documentation and mirrors SQL’s COUNT(column) semantics, which ignore NULLs. Related factors:
- Missing values in the column used for count
- Expecting .size() to behave like .count()
- Not specifying the column when calling .count()
To diagnose this in your code:
# Detect columns with missing values that affect count
nan_counts = df.groupby('group')['value'].apply(lambda s: s.isna().sum())
if nan_counts.any():
print('NaNs affecting count per group:')
print(nan_counts)
Fixing the Issue
Quick fix (1‑liner):
# Use count if you need non‑NA rows, or size if you need total rows
result = df.groupby('group')['value'].agg(['size', 'count'])
When you need production‑ready safety, validate the presence of NaNs and decide which metric is appropriate:
import logging
grouped = df.groupby('group')
size = grouped.size()
count = grouped['value'].count()
# Log any discrepancy caused by NaNs
if not size.equals(count):
diff = size - count
logging.warning('Size and count differ for groups: %s', diff[diff != 0])
# Choose the correct metric for downstream logic
# Example: use count for average calculations that require non‑NA values
final = pd.DataFrame({
'total_rows': size,
'non_nan_rows': count
})
The gotcha here is assuming .count() counts all rows; it only counts non‑missing entries. In our ETL jobs we once trusted .count() for completeness and later discovered missing values skewed our aggregates, forcing us to add explicit NaN checks.
What Doesn’t Work
❌ Filling NaNs with 0 before .count(): masks missing data and changes semantics
❌ Dropping rows with NaN before grouping: loses information and can bias results
❌ Switching to .sum() on a boolean mask to emulate count: slower and less readable
- Assuming .count() includes NaN rows
- Using .size() when you need non‑NA counts
- Not specifying a column in .count() and getting an unexpected Series
When NOT to optimize
- Exploratory notebooks: When you are quickly inspecting data and the difference is intentional.
- Known one‑to‑many relationship: If you expect size to be larger because missing values are meaningful.
- Small test datasets: Under a few dozen rows, the performance impact of extra checks is negligible.
- One‑off scripts: Ad‑hoc data cleaning where strict validation is unnecessary.
Frequently Asked Questions
Q: Why does .size() show more rows than .count() for the same group?
Because .size() counts all rows, while .count() skips NaN values in the selected column.
Understanding the distinction between .size() and .count() is essential for reliable groupwise metrics. By explicitly checking for NaNs and choosing the appropriate aggregation, you prevent silent data quality issues in production pipelines. Incorporate these guards early to keep downstream analytics trustworthy.
Related Issues
→ Fix pandas groupby count includes NaN → Why pandas merge duplicates rows after groupby → Fix pandas fillna not working on specific columns → Fix pandas outer join creates NaN rows