CPython reference counting vs garbage collection: memory management guide
Unexpected memory growth in pandas DataFrames often appears in production pipelines processing large CSV exports or API feeds, where objects linger beyond their useful scope. This is caused by CPython’s combination of reference counting and a cyclic garbage collector that may defer cleanup of reference cycles.
# Example showing the issue
import pandas as pd
import gc
def create_cycle():
df = pd.DataFrame({'a': [1, 2, 3]})
df.cycle = df # creates a reference cycle
return df
print(f"Before: {gc.get_count()}")
obj = create_cycle()
print(f"After creation: {gc.get_count()}")
obj = None
print(f"After del reference (no GC): {gc.get_count()}")
collected = gc.collect()
print(f"After gc.collect(): {gc.get_count()}, objects collected: {collected}")
CPython frees most objects immediately via reference counting, but when objects reference each other in a cycle the count never drops to zero. The cyclic garbage collector then scans for such groups and reclaims them, which may happen later than expected. This behavior follows the CPython memory management design documented in the official Python data model. Related factors:
- Objects holding references to themselves or each other
- Large pandas structures that embed callbacks or closures
- Disabled or tuned gc thresholds in production settings
To diagnose this in your code:
# Detect objects that are part of reference cycles
import gc
gc.set_debug(gc.DEBUG_SAVEALL)
# Force a collection to populate gc.garbage
gc.collect()
if gc.garbage:
print('Objects in uncollectable cycles:')
for obj in gc.garbage:
print(type(obj), getattr(obj, 'shape', 'no shape'))
else:
print('No uncollectable cycles found')
Fixing the Issue
Understanding the two mechanisms lets you choose the right tool. If your code never creates cycles (e.g., simple DataFrames passed around without callbacks), reference counting alone is enough and you can rely on Python to free memory instantly. When cycles are possible—common with pandas objects that store lambda functions, custom __del__ methods, or mutually referencing containers—you should either break the cycle manually or let the cyclic GC handle it.
Manual break (production‑ready):
import weakref
import gc
def safe_create():
df = pd.DataFrame({'a': [1, 2, 3]})
# Use a weak reference to avoid a hard cycle
df_ref = weakref.ref(df)
return df_ref()
obj = safe_create()
# No explicit cycle, object is reclaimed as soon as ``obj`` is deleted
obj = None
gc.collect() # optional explicit collection
When you prefer the GC:
import gc
# Tune thresholds for large data pipelines
gc.set_threshold(700, 10, 10)
# Periodically trigger collection after heavy batch work
gc.collect()
Both approaches avoid silent memory bloat. The first gives deterministic cleanup; the second leans on CPython’s built‑in cycle detector while giving you control over when it runs.
What Doesn’t Work
❌ Calling gc.disable() globally: prevents collection of any cycles and can cause memory leaks in long‑running services
❌ Adding df = None after every operation: masks the real issue and adds unnecessary assignments without breaking cycles
❌ Using df.copy(deep=True) to “reset” memory: copies the data but keeps the original cycle alive, doubling memory usage temporarily
- Assuming reference counting alone frees all pandas objects
- Disabling gc without measuring the impact on long‑running jobs
- Using
df.apply(lambda x: ...)that captures the DataFrame and creates hidden cycles
When NOT to optimize
- Small scripts: Under a few megabytes of data, the overhead of manual cycle breaking outweighs any memory gain.
- One‑off analysis: Interactive Jupyter notebooks where you restart the kernel frequently.
- Known one‑to‑many relationships: When pandas objects intentionally reference each other (e.g., parent/child DataFrames) and memory usage is acceptable.
- Disabled GC by design: Some high‑performance services disable the cyclic GC and rely on process restarts to reclaim memory.
Frequently Asked Questions
Q: Can I rely on del to free a pandas DataFrame that participates in a cycle?
No; del only decrements the count, the cycle must be broken or collected by the GC.
Q: Does increasing gc thresholds improve performance?
It reduces collection frequency, which can help speed, but may increase peak memory usage.
Balancing CPython’s reference counting with its cyclic garbage collector is essential when pandas DataFrames live inside complex pipelines. By either breaking reference cycles explicitly or tuning the GC, you keep memory footprints predictable and avoid surprising growth in production workloads.
Related Issues
→ Why Python GC tunables slow pandas DataFrame processing → Why Python objects consume excess memory → Why deepcopy on pandas DataFrames causes infinite recursion → Why numpy ravel vs flatten affect memory usage