Open file descriptor leaks: detection and resolution in Python
Unexpected descriptor growth in Python often appears when processing large CSVs with pandas, where a manually opened file object is left dangling. In production ETL pipelines reading millions of rows, each stray handle consumes a descriptor, eventually exhausting the OS limit and breaking downstream tasks.
# Example showing the issue
import os, pandas as pd
def count_fds():
return len(os.listdir('/proc/self/fd'))
print('fds before:', count_fds())
# Leak: file opened without a context manager
f = open('data.csv')
df = pd.read_csv(f)
# f.close() # <-- omitted on purpose
print('fds after load:', count_fds())
print('DataFrame shape:', df.shape)
A file object opened with the built‑in open() stays alive until its close() method runs or the garbage collector finalizes it. When the reference is kept in a long‑running loop or forgotten, the OS descriptor remains allocated. This behavior follows the standard POSIX file‑handle model and often surprises developers who assume pandas will close external handles automatically. Related factors:
- Missing with‑statement around open()
- Long‑lived objects retaining file references
- GC delay in finalizing objects
To diagnose this in your code:
import os, psutil
proc = psutil.Process()
def report_fds(label):
print(f"{label}: {proc.num_fds()} open descriptors")
report_fds('before')
# code that may leak descriptors
report_fds('after')
Fixing the Issue
The quickest fix is to wrap the file in a context manager so it is closed as soon as pandas finishes reading:
with open('data.csv') as f:
df = pd.read_csv(f)
For production code you should also validate that no stray descriptors remain, especially in loops:
import logging, psutil
proc = psutil.Process()
for batch in batches:
before = proc.num_fds()
with open(batch) as f:
df = pd.read_csv(f)
after = proc.num_fds()
if after > before:
logging.warning(f"Leaked {after - before} descriptor(s) while processing {batch}")
This pattern guarantees deterministic closure and gives you a safety net that alerts when a leak occurs.
What Doesn’t Work
❌ Calling df.close(): pandas DataFrames have no close method, so this does nothing
❌ Deleting the variable only: del f leaves the underlying OS handle open until GC runs
❌ Using gc.collect() to force closure: may free memory but does not guarantee the descriptor is released promptly
- Opening files with open() and forgetting to close()
- Relying on garbage collection to release descriptors
- Using df = pd.read_csv(path) inside a custom open() without closing the wrapper
When NOT to optimize
- One‑off scripts: Temporary analysis that runs once and exits quickly
- Tiny datasets: Under a few hundred rows, the descriptor count stays far below the limit
- Read‑only notebooks: Interactive exploration where the interpreter will close handles on shutdown
- Controlled environments: When the OS limit is set very high and the workload is known to be safe
Frequently Asked Questions
Q: How can I tell if a leak is happening without external tools?
Compare psutil.Process().num_fds() before and after the suspect code block.
Q: Does pandas.read_csv close a file passed to it?
Yes, but only if pandas opened the file itself; it does not close a user‑provided file object.
File‑descriptor leaks are silent until the process hits the OS limit, at which point I/O operations fail catastrophically. By coupling context managers with a simple fd count check, you get both deterministic cleanup and early warning in production pipelines.
Related Issues
→ Why objgraph misses circular reference leaks → Fix subprocess communicate deadlock in Python → Why itertools tee can cause memory leak → Why Python objects consume excess memory