Python multiprocessing shared memory pattern: duplicate rows in pandas DataFrames
Duplicate rows in a pandas DataFrame often appear in production ETL pipelines that use Python’s multiprocessing shared_memory to pass large tables between worker processes. The underlying NumPy buffer is shared without proper synchronization, causing each process to see stale or multiplied data.
# Example showing the issue
import pandas as pd
import numpy as np
from multiprocessing import Process
from multiprocessing import shared_memory
def worker(name):
shm = shared_memory.SharedMemory(name=name)
arr = np.ndarray((3, 2), dtype=np.int64, buffer=shm.buf)
df = pd.DataFrame(arr, columns=['a', 'b'])
print(f"Worker sees {len(df)} rows")
print(df)
if __name__ == '__main__':
df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['a', 'b'])
arr = df.values
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes * 2) # oversize on purpose
np.copyto(np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf), arr)
p1 = Process(target=worker, args=(shm.name,))
p2 = Process(target=worker, args=(shm.name,))
p1.start(); p2.start()
p1.join(); p2.join()
shm.close(); shm.unlink()
# Output shows each worker reports 3 rows, but later reads may see duplicated rows due to buffer overrun
The shared NumPy buffer is attached by multiple processes without any lock or size validation. pandas builds a DataFrame on top of that buffer, so if the shared memory segment is larger than the original array or if writes interleave, each process can interpret leftover bytes as extra rows, producing duplicate entries. This behavior follows the multiprocessing.shared_memory documentation, which states that synchronization is the developer’s responsibility. Related factors:
- Oversized shared memory segment
- Concurrent writes without locks
- Implicit view creation without shape checks
To diagnose this in your code:
# Verify that the shared memory size matches the expected NumPy array size
expected = df.values.nbytes
actual = shm.size
if actual != expected:
print(f"Size mismatch: expected {expected} bytes, got {actual} bytes")
# Show how many rows would be inferred from the buffer
inferred_rows = actual // (df.shape[1] * df.values.dtype.itemsize)
print(f"Buffer would be interpreted as {inferred_rows} rows")
Fixing the Issue
The quickest way to avoid the duplicate‑row surprise is to avoid sharing the raw NumPy buffer and instead serialize the DataFrame:
import pickle, multiprocessing as mp
def worker(pipe):
df = pickle.loads(pipe.recv())
print(df)
if __name__ == '__main__':
df = pd.DataFrame(...)
parent_conn, child_conn = mp.Pipe()
p = mp.Process(target=worker, args=(child_conn,))
p.start()
parent_conn.send(pickle.dumps(df))
p.join()
For production‑ready shared memory, create the segment with the exact size, expose shape/dtype metadata, and guard accesses with a lock:
import multiprocessing as mp
from multiprocessing import shared_memory, Lock
lock = Lock()
shape = df.shape
dtype = df.values.dtype
shm = shared_memory.SharedMemory(create=True, size=df.values.nbytes)
np.ndarray(shape, dtype=dtype, buffer=shm.buf)[:] = df.values
def worker(name, lock, shape, dtype):
shm = shared_memory.SharedMemory(name=name)
with lock:
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
df = pd.DataFrame(arr.copy(), columns=['a', 'b']) # copy to avoid accidental writes
print(df)
shm.close()
if __name__ == '__main__':
p = mp.Process(target=worker, args=(shm.name, lock, shape, dtype))
p.start(); p.join()
shm.close(); shm.unlink()
This approach validates the buffer size, shares shape/dtype explicitly, and uses a lock to prevent concurrent mutations, eliminating hidden row duplication.
What Doesn’t Work
❌ Using .copy() on the DataFrame after attaching the buffer: this copies stale data but does not address the size mismatch, so duplicate rows can still appear.
❌ Increasing the shared memory size to “be safe”: oversizing just masks the root cause and leads to mis‑interpreted extra rows.
❌ Relying on multiprocessing.Queue to pass the DataFrame without serialization: the queue pickles the object, defeating the purpose of zero‑copy shared memory and adds overhead.
- Creating a shared memory segment larger than the original array and forgetting to validate size.
- Modifying the NumPy buffer in one process without a lock, causing race conditions.
- Building pandas DataFrames directly from the shared buffer without copying, leading to accidental in‑place changes.
When NOT to optimize
- Small datasets: Under a few thousand rows the overhead of serialization is negligible.
- One‑off scripts: Ad‑hoc analyses that run once and are not part of a pipeline.
- Read‑only sharing: If processes only read and never modify the data, the risk of duplication is minimal.
- Prototype stage: Early experimentation where performance is not yet a concern.
Frequently Asked Questions
Q: Do I need a lock if processes only read the shared DataFrame?
No, reads are safe, but any concurrent write still requires synchronization.
Q: Can I share a pandas DataFrame without pickling?
Yes, by exposing the underlying NumPy buffer with exact shape/dtype and using locks.
Sharing large pandas DataFrames via multiprocessing can boost performance, but the convenience of a raw NumPy buffer comes with synchronization pitfalls. By validating buffer dimensions, exposing shape metadata, and protecting writes with locks, you keep the data consistent and avoid silent row duplication in production pipelines.
Related Issues
→ Fix pandas merge many to many duplicates rows → Fix numpy concatenate memory allocation issue → Why pandas merge duplicates rows after groupby → Why buffer protocol speeds up pandas DataFrame I/O