NumPy array vs list memory layout: cause and fix

The script processed a few thousand rows without complaint, but once the input grew to 2 million items the job stalled and memory usage exploded. The pandas DataFrame built from a list of numbers ballooned to 12 GB, while the same DataFrame built from a NumPy array stayed under 2 GB. Only after profiling the objects did we realize the list was the hidden cost.

Here’s what this looks like:

import sys, numpy as np

# WRONG: accumulate Python objects, then copy into NumPy
values = []  # list of Python floats
with open('big.csv') as f:
    for line in f:
        # this loop was lifted from an old prototype
        values.append(float(line.strip()))  # FIXME: appending is slow, creates many objects

# later we cast to an array – another full copy
arr = np.array(values)  # copies data again, doubles memory
print('list length:', len(values))
# rough estimate of list memory (pointers + objects)
list_mem = sys.getsizeof(values) + sum(sys.getsizeof(v) for v in values)
print('list approx. size:', list_mem, 'bytes')
print('array size (contiguous):', arr.nbytes, 'bytes')

Check your code:

import sys, numpy as np

# quick sanity check on a small sample
lst = [float(i) for i in range(1000)]
arr = np.array(lst)
print('list size (approx.):', sys.getsizeof(lst) + sum(sys.getsizeof(v) for v in lst))
print('array nbytes:', arr.nbytes)
# output shows list uses >150 KB, array only 8 KB

The root cause is that a Python list stores references to individual float objects, each with its own header and heap allocation, while a NumPy array stores raw 64‑bit values in a single contiguous block. This difference yields orders‑of‑magnitude higher memory overhead and cache‑miss penalties for lists. The NumPy documentation states that arrays are “homogeneous and densely packed in memory”, which is precisely why arithmetic kernels run faster on them.

Fixing this

Build the array directly, bypassing the intermediate Python objects. For file‑based data, NumPy offers generators that create the buffer on the fly:

import numpy as np, pandas as pd

# direct construction avoids the temporary list
arr = np.fromiter((float(line) for line in open('big.csv')), dtype=np.float64, count=2_000_000)

# now the DataFrame wraps the already‑contiguous memory
df = pd.DataFrame({'value': arr})
print('DataFrame memory usage:', df.memory_usage(deep=True).sum())

The gotcha is that many legacy scripts still use list comprehensions followed by np.array(). Switching to np.fromiter or np.loadtxt eliminates the double allocation and keeps memory footprints low. In production pipelines we also add a guard that raises if the input size exceeds an expected threshold, preventing accidental fallback to list‑based paths.

After switching to np.fromiter the job completed in 45 seconds and memory stayed under 1.8 GB. The pipeline now runs reliably on the nightly schedule without OOM crashes.


Last verified: 2026-02-05

Why NumPy strides affect memory layoutWhy numpy boolean indexing spikes memoryFix numpy concatenate memory allocation issueWhy numpy ravel vs flatten affect memory usage