How to Detect Open File Descriptor Leaks in Python

Open file descriptor leaks: detection and resolution in Python

Unexpected descriptor growth in Python often appears when processing large CSVs with pandas, where a manually opened file object is left dangling. In production ETL pipelines reading millions of rows, each stray handle consumes a descriptor, eventually exhausting the OS limit and breaking downstream tasks.

# Example showing the issue
import os, pandas as pd

def count_fds():
    return len(os.listdir('/proc/self/fd'))

print('fds before:', count_fds())

# Leak: file opened without a context manager
f = open('data.csv')
df = pd.read_csv(f)
# f.close()  # <-- omitted on purpose

print('fds after load:', count_fds())
print('DataFrame shape:', df.shape)

A file object opened with the built‑in open() stays alive until its close() method runs or the garbage collector finalizes it. When the reference is kept in a long‑running loop or forgotten, the OS descriptor remains allocated. This behavior follows the standard POSIX file‑handle model and often surprises developers who assume pandas will close external handles automatically. Related factors:

Missing with‑statement around open()
Long‑lived objects retaining file references
GC delay in finalizing objects

To diagnose this in your code:

import os, psutil

proc = psutil.Process()

def report_fds(label):
    print(f"{label}: {proc.num_fds()} open descriptors")

report_fds('before')
# code that may leak descriptors
report_fds('after')

Fixing the Issue

The quickest fix is to wrap the file in a context manager so it is closed as soon as pandas finishes reading:

with open('data.csv') as f:
    df = pd.read_csv(f)

For production code you should also validate that no stray descriptors remain, especially in loops:

import logging, psutil
proc = psutil.Process()

for batch in batches:
    before = proc.num_fds()
    with open(batch) as f:
        df = pd.read_csv(f)
    after = proc.num_fds()
    if after > before:
        logging.warning(f"Leaked {after - before} descriptor(s) while processing {batch}")

This pattern guarantees deterministic closure and gives you a safety net that alerts when a leak occurs.

What Doesn’t Work

❌ Calling df.close(): pandas DataFrames have no close method, so this does nothing

❌ Deleting the variable only: del f leaves the underlying OS handle open until GC runs

❌ Using gc.collect() to force closure: may free memory but does not guarantee the descriptor is released promptly

Opening files with open() and forgetting to close()
Relying on garbage collection to release descriptors
Using df = pd.read_csv(path) inside a custom open() without closing the wrapper

When NOT to optimize

One‑off scripts: Temporary analysis that runs once and exits quickly
Tiny datasets: Under a few hundred rows, the descriptor count stays far below the limit
Read‑only notebooks: Interactive exploration where the interpreter will close handles on shutdown
Controlled environments: When the OS limit is set very high and the workload is known to be safe

Frequently Asked Questions

Q: How can I tell if a leak is happening without external tools?

Compare psutil.Process().num_fds() before and after the suspect code block.

Q: Does pandas.read_csv close a file passed to it?

Yes, but only if pandas opened the file itself; it does not close a user‑provided file object.

File‑descriptor leaks are silent until the process hits the OS limit, at which point I/O operations fail catastrophically. By coupling context managers with a simple fd count check, you get both deterministic cleanup and early warning in production pipelines.

→ Why objgraph misses circular reference leaks → Fix subprocess communicate deadlock in Python → Why itertools tee can cause memory leak → Why Python objects consume excess memory

Open file descriptor leaks: detection and resolution in Python#

Fixing the Issue#

What Doesn’t Work#

When NOT to optimize#

Frequently Asked Questions#

Related Issues#

Open file descriptor leaks: detection and resolution in Python

Fixing the Issue

What Doesn’t Work

When NOT to optimize

Frequently Asked Questions

Related Issues