Pytest parametrize slowdown: cause and fix
We saw the test suite crawl for 12 minutes, CPU spiking to 200%, and the CI log filling with lines like test_my_func[param0-0] PASSED. The test looked fine—just a simple parametrize over a pandas DataFrame—but the run time was absurd. Only after we printed the number of generated test cases did we realize we were spawning thousands of heavy DataFrames.
Here’s what this looks like:
import pytest, pandas as pd
# spent 30min debugging this
@pytest.mark.parametrize(
"df, multiplier",
[
(pd.DataFrame({"value": range(1_000)}), 2),
(pd.DataFrame({"value": range(2_000)}), 3),
(pd.DataFrame({"value": range(3_000)}), 4),
# ... imagine 50 more combos generated by a script
],
)
def test_sum(df, multiplier):
# each param builds a fresh DataFrame – heavy!
result = df["value"].sum() * multiplier
assert result > 0
print(f"param combos: {len(pytest.config.getoption('collectonly'))}")
Run the collection step alone to count generated tests:
pytest -q --collect-only | grep test_sum | wc -l
If the count is in the thousands, you’re likely over‑parametrizing. You can also add `-vv` and watch the same test name appear repeatedly in the output.
The root cause is that @pytest.mark.parametrize expands every tuple into a separate test case, and each case constructs a fresh pandas DataFrame. Pandas DataFrames are expensive to build, so multiplying them by dozens of combos explodes runtime. This isn’t a pytest bug—it’s a consequence of how parametrize works, and it mirrors SQL’s Cartesian product when joining tables with duplicate keys. See the pandas docs on DataFrame construction cost for details.
Fixing this
The quick win is to move the heavy DataFrame creation into a fixture that runs once per test module:
@pytest.fixture(scope="module")
def base_df():
# FIXME: temporary until upstream provides a cached CSV
return pd.DataFrame({"value": range(5_000)})
@pytest.mark.parametrize("multiplier", [2, 3, 4])
def test_sum(base_df, multiplier):
result = base_df["value"].sum() * multiplier
assert result > 0
The gotcha here is that the fixture must be scoped appropriately; otherwise you’ll still rebuild the frame for every parametrized value. For a production‑ready approach, generate the parameter list without embedding DataFrames, and let the fixture slice or transform the shared DataFrame:
@pytest.fixture(scope="module")
def shared_df():
return pd.DataFrame({"value": range(10_000)})
# indirect param to avoid creating new frames each time
@pytest.mark.parametrize("size,multiplier", [(1_000, 2), (2_000, 3), (3_000, 4)], indirect=True)
def test_sum(shared_df, size, multiplier):
df = shared_df.head(size) # cheap view, no copy
result = df["value"].sum() * multiplier
assert result > 0
By decoupling the DataFrame from the param list you keep the number of generated tests low while still exercising the logic under different sizes. If you truly need many combos, consider using pytest-benchmark to profile a single test loop instead of exploding the param matrix.
We discovered this after our nightly CI started missing the 10‑minute threshold – the fix shaved the run down to under a minute.
After switching to module‑scoped fixtures the suite finishes in 45 seconds and the CI no longer flags a timeout – the pipeline is back to its SLA.
Tested on Python 3.12
Related Issues
→ Why numpy object dtype hurts pandas performance → Why Python GC tunables slow pandas DataFrame processing → Why Sentry capture of pandas DataFrames hurts performance → Fix How cffi vs ctypes impacts performance