Deepcopy vs shallow copy in pandas: recursion pitfall and fix
Infinite recursion in pandas DataFrame copy usually appears in production pipelines that clone large nested structures, where a DataFrame holds a reference back to itself via a column of objects. This results from deepcopy traversing the cycle and exhausting Python’s recursion limit, silently breaking downstream processing.
# Example showing the issue
import pandas as pd
import copy
# Create a self‑referencing DataFrame
df = pd.DataFrame({'a': [1, 2]})
df['self'] = [df, df] # each cell points back to the whole DataFrame
print(f"df rows: {len(df)}")
try:
df_copy = copy.deepcopy(df)
except RecursionError as e:
print('RecursionError:', e)
# Output shows RecursionError caused by infinite recursion
The DataFrame contains a column that references the DataFrame itself. deepcopy follows every reference it encounters, sees the self‑reference, and starts copying the original DataFrame again, creating an endless loop until Python’s recursion limit is reached. This behavior follows the semantics of the copy module documented in the Python standard library. Related factors:
- Object columns holding the parent DataFrame
- Cyclic references introduced by custom objects
- No explicit deepcopy handling for pandas objects
To diagnose this in your code:
# Detect self‑referencing columns in a DataFrame
self_refs = [col for col in df.columns if df[col].apply(lambda x: x is df).any()]
print('Self‑referencing columns:', self_refs)
Fixing the Issue
The quickest way to avoid the crash is to use a shallow copy, which does not traverse the object graph:
df_copy = df.copy(deep=False)
For production code you should guard against self‑references before deep copying:
import copy, logging
def safe_deepcopy(frame: pd.DataFrame) -> pd.DataFrame:
# Identify columns that point back to the frame
self_cols = [c for c in frame.columns if frame[c].apply(lambda x: x is frame).any()]
# Remove them temporarily
temp = frame.drop(columns=self_cols)
# Perform a true deepcopy on the safe subset
result = copy.deepcopy(temp)
# Re‑attach empty placeholders to keep the schema intact
for c in self_cols:
result[c] = None
return result
logging.info('Performing safe deepcopy of DataFrame')
df_safe = safe_deepcopy(df)
assert len(df_safe) == len(df), 'Row count mismatch after safe deepcopy'
The production approach logs any problematic columns, strips them before deep copying, and restores the expected schema, preventing silent recursion failures.
What Doesn’t Work
❌ Using sys.setrecursionlimit(10000) to avoid RecursionError: merely postpones the crash and can crash the interpreter later.
❌ Applying df.apply(lambda x: copy.deepcopy(x)) row‑wise: still traverses the self‑reference and raises the same recursion problem.
❌ Dropping the self‑referencing column after deepcopy: the error occurs before you get a chance to drop it.
- Calling copy.deepcopy on a DataFrame with object columns without checking for cycles
- Assuming df.copy() is always a deep copy; it defaults to deep=True but still follows object references
- Increasing sys.setrecursionlimit to silence the problem instead of fixing the cycle
When NOT to optimize
- Exploratory notebooks: Small, throw‑away analyses where performance is not critical.
- One‑off scripts: Data migrations run once, the risk of recursion is negligible.
- Known one‑to‑many relationships: When self‑references are intentional and handled downstream.
- Tiny datasets: Fewer than a few dozen rows, the overhead of extra checks outweighs the benefit.
Frequently Asked Questions
Q: Can I safely deepcopy a pandas DataFrame that contains custom objects?
Only if those objects do not hold references back to the DataFrame; otherwise you must break the cycle first.
Q: Is shallow copy always safe for large DataFrames?
Shallow copy copies the underlying data buffers, so it is memory‑efficient and avoids recursion, but modifications affect the original.
Self‑referencing structures are a hidden source of crashes in data pipelines that rely on deepcopy. By detecting and removing cyclic references—or by opting for a shallow copy—you keep your pandas workflows robust and avoid unexpected recursion limits. Remember to validate the copy before downstream processing.
Related Issues
→ Fix pandas SettingWithCopyWarning false positive → Why Python GC tunables slow pandas DataFrame processing → Why pandas concat uses more memory than append → Why CPython ref counting vs GC impacts memory