Pandas concat vs append memory usage: detection and optimization

Unexpected memory spikes in pandas operations often appear in production pipelines that stitch together large CSV exports or API feeds, where developers concatenate DataFrames in a loop. This leads to excessive RAM consumption that can crash downstream analysis.

# Example showing the issue
import pandas as pd, numpy as np, sys

def mem(df):
    return df.memory_usage(deep=True).sum()

# Simulate 5 chunks of 200k rows each
chunks = [pd.DataFrame({'a': np.random.rand(200_000), 'b': np.random.randint(0,100,size=200_000)}) for _ in range(5)]

# Repeated append (deprecated but still works in older versions)
df_append = pd.DataFrame()
for chunk in chunks:
    df_append = df_append.append(chunk, ignore_index=True)
print('Append memory:', mem(df_append))

# Single concat
df_concat = pd.concat(chunks, ignore_index=True)
print('Concat memory:', mem(df_concat))
# Output shows concat uses less memory than the repeated append loop

Repeated DataFrame.append in a loop creates a brand‑new DataFrame on every iteration, copying both the accumulated result and the new chunk each time. Those intermediate copies inflate the process’s resident set size. A single pd.concat builds the final object once, avoiding the cascade of temporary copies. This behavior mirrors standard Python list concatenation costs. Related factors:

  • Number of iterations
  • Size of each chunk
  • Absence of in‑place extension API

To diagnose this in your code:

# Monitor memory during a loop
import psutil, os
process = psutil.Process(os.getpid())
for i, chunk in enumerate(chunks, 1):
    df = df.append(chunk, ignore_index=True)
    print(f'After iteration {i}: {process.memory_info().rss/1e6:.1f} MB')
# A steady upward trend signals the issue

Fixing the Issue

The quick, readable fix is to avoid the loop entirely:

df = pd.concat(chunks, ignore_index=True)

For production pipelines you may need explicit validation and logging:

import logging
logging.basicConfig(level=logging.INFO)

if len(chunks) > 1:
    logging.info('Concatenating %d chunks', len(chunks))
    df = pd.concat(chunks, ignore_index=True)
else:
    df = chunks[0]

# Verify memory footprint stays within expectations
expected = sum(c.memory_usage(deep=True).sum() for c in chunks)
actual = df.memory_usage(deep=True).sum()
assert actual <= expected * 1.1, f'Unexpected memory growth: {actual}'

This pattern logs the operation, guards against accidental use of .append in a loop, and asserts that the final DataFrame does not exceed a reasonable memory budget.

What Doesn’t Work

❌ Using df.reset_index(inplace=True) after each append: only reindexes, does not reduce copies

❌ Converting each chunk to a NumPy array then back to DataFrame: adds conversion overhead and larger memory peaks

❌ Calling .copy() on every chunk before appending: doubles memory usage each iteration

  • Appending inside a for‑loop without measuring memory
  • Calling pd.concat inside the loop instead of once
  • Neglecting to drop references to intermediate DataFrames

When NOT to optimize

  • Tiny datasets: Under a few thousand rows, the overhead is negligible
  • One‑off analysis: Interactive notebooks where speed matters more than RAM
  • Already using list‑of‑chunks: If you already have a list, concat is the natural step
  • Legacy code: Maintaining backward compatibility where append is still supported

Frequently Asked Questions

Q: Does pandas.append create a copy of the original DataFrame?

Yes, each call returns a new DataFrame, copying both the existing data and the new chunk.


Memory efficiency in pandas hinges on minimizing temporary DataFrames. By collecting chunks and calling pd.concat once, you keep RAM usage predictable and avoid hidden spikes that can derail production jobs. Keep an eye on the process RSS during any iterative build.

Why pandas merge on categorical columns slows downWhy buffer protocol speeds up pandas DataFrame I/OWhy pandas read_parquet loads faster than read_csvFix pandas merge using index gives wrong result