CPython reference counting vs garbage collection: memory management guide

Unexpected memory growth in pandas DataFrames often appears in production pipelines processing large CSV exports or API feeds, where objects linger beyond their useful scope. This is caused by CPython’s combination of reference counting and a cyclic garbage collector that may defer cleanup of reference cycles.

# Example showing the issue
import pandas as pd
import gc

def create_cycle():
    df = pd.DataFrame({'a': [1, 2, 3]})
    df.cycle = df  # creates a reference cycle
    return df

print(f"Before: {gc.get_count()}")
obj = create_cycle()
print(f"After creation: {gc.get_count()}")
obj = None
print(f"After del reference (no GC): {gc.get_count()}")
collected = gc.collect()
print(f"After gc.collect(): {gc.get_count()}, objects collected: {collected}")

CPython frees most objects immediately via reference counting, but when objects reference each other in a cycle the count never drops to zero. The cyclic garbage collector then scans for such groups and reclaims them, which may happen later than expected. This behavior follows the CPython memory management design documented in the official Python data model. Related factors:

  • Objects holding references to themselves or each other
  • Large pandas structures that embed callbacks or closures
  • Disabled or tuned gc thresholds in production settings

To diagnose this in your code:

# Detect objects that are part of reference cycles
import gc
gc.set_debug(gc.DEBUG_SAVEALL)
# Force a collection to populate gc.garbage
gc.collect()
if gc.garbage:
    print('Objects in uncollectable cycles:')
    for obj in gc.garbage:
        print(type(obj), getattr(obj, 'shape', 'no shape'))
else:
    print('No uncollectable cycles found')

Fixing the Issue

Understanding the two mechanisms lets you choose the right tool. If your code never creates cycles (e.g., simple DataFrames passed around without callbacks), reference counting alone is enough and you can rely on Python to free memory instantly. When cycles are possible—common with pandas objects that store lambda functions, custom __del__ methods, or mutually referencing containers—you should either break the cycle manually or let the cyclic GC handle it.

Manual break (production‑ready):

import weakref
import gc

def safe_create():
    df = pd.DataFrame({'a': [1, 2, 3]})
    # Use a weak reference to avoid a hard cycle
    df_ref = weakref.ref(df)
    return df_ref()

obj = safe_create()
# No explicit cycle, object is reclaimed as soon as ``obj`` is deleted
obj = None
gc.collect()  # optional explicit collection

When you prefer the GC:

import gc
# Tune thresholds for large data pipelines
gc.set_threshold(700, 10, 10)
# Periodically trigger collection after heavy batch work
gc.collect()

Both approaches avoid silent memory bloat. The first gives deterministic cleanup; the second leans on CPython’s built‑in cycle detector while giving you control over when it runs.

What Doesn’t Work

❌ Calling gc.disable() globally: prevents collection of any cycles and can cause memory leaks in long‑running services

❌ Adding df = None after every operation: masks the real issue and adds unnecessary assignments without breaking cycles

❌ Using df.copy(deep=True) to “reset” memory: copies the data but keeps the original cycle alive, doubling memory usage temporarily

  • Assuming reference counting alone frees all pandas objects
  • Disabling gc without measuring the impact on long‑running jobs
  • Using df.apply(lambda x: ...) that captures the DataFrame and creates hidden cycles

When NOT to optimize

  • Small scripts: Under a few megabytes of data, the overhead of manual cycle breaking outweighs any memory gain.
  • One‑off analysis: Interactive Jupyter notebooks where you restart the kernel frequently.
  • Known one‑to‑many relationships: When pandas objects intentionally reference each other (e.g., parent/child DataFrames) and memory usage is acceptable.
  • Disabled GC by design: Some high‑performance services disable the cyclic GC and rely on process restarts to reclaim memory.

Frequently Asked Questions

Q: Can I rely on del to free a pandas DataFrame that participates in a cycle?

No; del only decrements the count, the cycle must be broken or collected by the GC.

Q: Does increasing gc thresholds improve performance?

It reduces collection frequency, which can help speed, but may increase peak memory usage.


Balancing CPython’s reference counting with its cyclic garbage collector is essential when pandas DataFrames live inside complex pipelines. By either breaking reference cycles explicitly or tuning the GC, you keep memory footprints predictable and avoid surprising growth in production workloads.

Why Python GC tunables slow pandas DataFrame processingWhy Python objects consume excess memoryWhy deepcopy on pandas DataFrames causes infinite recursionWhy numpy ravel vs flatten affect memory usage