Fix multiprocessing shared memory duplicate rows in pandas

Python multiprocessing shared memory pattern: duplicate rows in pandas DataFrames

Duplicate rows in a pandas DataFrame often appear in production ETL pipelines that use Python’s multiprocessing shared_memory to pass large tables between worker processes. The underlying NumPy buffer is shared without proper synchronization, causing each process to see stale or multiplied data.

# Example showing the issue
import pandas as pd
import numpy as np
from multiprocessing import Process
from multiprocessing import shared_memory

def worker(name):
    shm = shared_memory.SharedMemory(name=name)
    arr = np.ndarray((3, 2), dtype=np.int64, buffer=shm.buf)
    df = pd.DataFrame(arr, columns=['a', 'b'])
    print(f"Worker sees {len(df)} rows")
    print(df)

if __name__ == '__main__':
    df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['a', 'b'])
    arr = df.values
    shm = shared_memory.SharedMemory(create=True, size=arr.nbytes * 2)  # oversize on purpose
    np.copyto(np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf), arr)
    p1 = Process(target=worker, args=(shm.name,))
    p2 = Process(target=worker, args=(shm.name,))
    p1.start(); p2.start()
    p1.join(); p2.join()
    shm.close(); shm.unlink()
    # Output shows each worker reports 3 rows, but later reads may see duplicated rows due to buffer overrun

The shared NumPy buffer is attached by multiple processes without any lock or size validation. pandas builds a DataFrame on top of that buffer, so if the shared memory segment is larger than the original array or if writes interleave, each process can interpret leftover bytes as extra rows, producing duplicate entries. This behavior follows the multiprocessing.shared_memory documentation, which states that synchronization is the developer’s responsibility. Related factors:

Oversized shared memory segment
Concurrent writes without locks
Implicit view creation without shape checks

To diagnose this in your code:

# Verify that the shared memory size matches the expected NumPy array size
expected = df.values.nbytes
actual = shm.size
if actual != expected:
    print(f"Size mismatch: expected {expected} bytes, got {actual} bytes")
    # Show how many rows would be inferred from the buffer
    inferred_rows = actual // (df.shape[1] * df.values.dtype.itemsize)
    print(f"Buffer would be interpreted as {inferred_rows} rows")

Fixing the Issue

The quickest way to avoid the duplicate‑row surprise is to avoid sharing the raw NumPy buffer and instead serialize the DataFrame:

import pickle, multiprocessing as mp

def worker(pipe):
    df = pickle.loads(pipe.recv())
    print(df)

if __name__ == '__main__':
    df = pd.DataFrame(...)
    parent_conn, child_conn = mp.Pipe()
    p = mp.Process(target=worker, args=(child_conn,))
    p.start()
    parent_conn.send(pickle.dumps(df))
    p.join()

For production‑ready shared memory, create the segment with the exact size, expose shape/dtype metadata, and guard accesses with a lock:

import multiprocessing as mp
from multiprocessing import shared_memory, Lock

lock = Lock()
shape = df.shape
dtype = df.values.dtype
shm = shared_memory.SharedMemory(create=True, size=df.values.nbytes)
np.ndarray(shape, dtype=dtype, buffer=shm.buf)[:] = df.values

def worker(name, lock, shape, dtype):
    shm = shared_memory.SharedMemory(name=name)
    with lock:
        arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
        df = pd.DataFrame(arr.copy(), columns=['a', 'b'])  # copy to avoid accidental writes
        print(df)
    shm.close()

if __name__ == '__main__':
    p = mp.Process(target=worker, args=(shm.name, lock, shape, dtype))
    p.start(); p.join()
    shm.close(); shm.unlink()


This approach validates the buffer size, shares shape/dtype explicitly, and uses a lock to prevent concurrent mutations, eliminating hidden row duplication.

What Doesn’t Work

❌ Using .copy() on the DataFrame after attaching the buffer: this copies stale data but does not address the size mismatch, so duplicate rows can still appear.

❌ Increasing the shared memory size to “be safe”: oversizing just masks the root cause and leads to mis‑interpreted extra rows.

❌ Relying on multiprocessing.Queue to pass the DataFrame without serialization: the queue pickles the object, defeating the purpose of zero‑copy shared memory and adds overhead.

Creating a shared memory segment larger than the original array and forgetting to validate size.
Modifying the NumPy buffer in one process without a lock, causing race conditions.
Building pandas DataFrames directly from the shared buffer without copying, leading to accidental in‑place changes.

When NOT to optimize

Small datasets: Under a few thousand rows the overhead of serialization is negligible.
One‑off scripts: Ad‑hoc analyses that run once and are not part of a pipeline.
Read‑only sharing: If processes only read and never modify the data, the risk of duplication is minimal.
Prototype stage: Early experimentation where performance is not yet a concern.

Frequently Asked Questions

Q: Do I need a lock if processes only read the shared DataFrame?

No, reads are safe, but any concurrent write still requires synchronization.

Q: Can I share a pandas DataFrame without pickling?

Yes, by exposing the underlying NumPy buffer with exact shape/dtype and using locks.

Sharing large pandas DataFrames via multiprocessing can boost performance, but the convenience of a raw NumPy buffer comes with synchronization pitfalls. By validating buffer dimensions, exposing shape metadata, and protecting writes with locks, you keep the data consistent and avoid silent row duplication in production pipelines.

→ Fix pandas merge many to many duplicates rows → Fix numpy concatenate memory allocation issue → Why pandas merge duplicates rows after groupby → Why buffer protocol speeds up pandas DataFrame I/O

Python multiprocessing shared memory pattern: duplicate rows in pandas DataFrames#

Fixing the Issue#

What Doesn’t Work#

When NOT to optimize#

Frequently Asked Questions#

Related Issues#

Python multiprocessing shared memory pattern: duplicate rows in pandas DataFrames

Fixing the Issue

What Doesn’t Work

When NOT to optimize

Frequently Asked Questions

Related Issues