Why timeit vs perf give different microbenchmark results

Q: Can perf be used on macOS?

Not directly; macOS provides `dtrace`/`sample` but they have different semantics, so stick with timeit on that platform.

Q: Do I need root privileges for perf?

On most Linux distros, non‑root users can run `perf stat` if the kernel allows the `perf_event_paranoid` setting; otherwise sudo is required.

timeit vs perf for microbenchmarks: cause and resolution

Unexpected timing variance in pandas DataFrame operations often shows up in production pipelines processing CSV exports or API streams, where developers rely on timeit to gauge performance. The underlying cause is that timeit measures Python‑level execution while perf captures low‑level CPU behavior, leading to divergent numbers.

# Example showing the issue
import pandas as pd
import timeit, subprocess, textwrap

# Simple DataFrame operation
df = pd.DataFrame({"a": range(1_000)})

# timeit measures wall‑clock time at the Python level
t = timeit.timeit('df["b"] = df["a"] * 2', globals=globals(), number=500)
print(f"timeit: {t:.6f} seconds")

# perf stat measures CPU cycles, cache misses, etc.
code = textwrap.dedent('''
    import pandas as pd
    df = pd.DataFrame({"a": range(1000)})
    df["b"] = df["a"] * 2
''')
result = subprocess.run(
    ['perf', 'stat', '-r', '3', '-x', ',', 'python', '-c', code],
    capture_output=True, text=True
)
# Last line of perf output contains the summary (e.g., cycles, instructions)
print('perf output (last line):')
print(result.stdout.splitlines()[-1])

# Sample output
# timeit: 0.842321 seconds
# perf output (last line):
# 12345678,cycles,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%

timeit executes the target code in the same Python interpreter and reports elapsed wall‑clock time, which includes interpreter overhead, garbage collection, and any JIT warm‑up. perf, on the other hand, attaches to the process at the OS level and counts hardware events such as CPU cycles, cache misses, and branch predictions. This behavior is identical to standard Linux perf semantics (see perf_event_open(2)) and often surprises developers who assume both tools measure the same thing. Related factors:

timeit includes Python bytecode dispatch cost
perf isolates hardware‑level costs, ignoring interpreter overhead
warm‑up effects are visible only with perf’s multiple runs

To diagnose this in your code:

# In CI, look for a large gap between timeit and perf results
# Example warning printed by a custom check script
import re, subprocess, json

def check_microbenchmark():
    # Run timeit (already captured in logs) – assume variable `t_timeit`
    t_timeit = 0.85  # seconds (placeholder from log)
    # Run perf and parse cycles
    perf_out = subprocess.check_output(['perf', 'stat', '-r', '1', '-x', ',', 'python', '-c', "print('hi')"], text=True)
    cycles_line = next(line for line in perf_out.splitlines() if 'cycles' in line)
    cycles = int(cycles_line.split(',')[0])
    # Rough conversion: assume 3 GHz CPU → 1 cycle ≈ 3.33e-10 s
    t_perf = cycles * 3.33e-10
    if t_perf * 10 < t_timeit:
        print('Warning: timeit reports ~10× slower than perf – likely interpreter overhead dominates')

# Running this script in CI surfaces the mismatch without manual inspection

Fixing the Issue

When you only need a rough sense of how many Python statements a function executes, timeit is convenient: wrap the snippet in a lambda or use the -m timeit CLI. It gives you a single number you can compare across revisions.

However, for production‑grade microbenchmarks—especially when the code touches NumPy or pandas internals that release the GIL—perf is the safer choice. Run it via subprocess or integrate it into a pytest plugin. Example:

import subprocess, shlex

def perf_measure(py_code: str):
    cmd = [
        "perf", "stat", "-r", "5", "-x", ",",
        "python", "-c", py_code
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    for line in result.stdout.splitlines():
        if 'cycles' in line:
            cycles = int(line.split(',')[0])
            return cycles * 3.33e-10  # seconds on a 3 GHz CPU
    raise RuntimeError('perf did not return cycles')

# Use the helper in a benchmark suite
seconds = perf_measure('import pandas as pd; df = pd.DataFrame({"a": range(1000)}); df["b"] = df["a"] * 2')
print(f'perf‑derived time: {seconds:.6f}s')

The gotcha here is that perf reports hardware events, so the numbers depend on the CPU model and its frequency scaling. Pin the CPU frequency or use perf stat -B for more stable results.

For a quick sanity check during development you can still fire off python -m timeit and compare the output to the perf‑derived time. If the gap exceeds an order of magnitude, investigate interpreter‑level costs (e.g., unnecessary object creation, excessive Python loops) before optimizing the C‑level code.

In short: use timeit for pure‑Python loops, switch to perf when the code involves pandas/DataFrame vectorized ops, C extensions, or you need insight into cache behavior.

What Doesn’t Work

❌ Wrapping the pandas operation in a try/except that catches all exceptions and returns a constant timing: it hides genuine performance regressions and corrupts benchmark data.

❌ Switching the join type to outer just to get a higher timeit value: this changes the algorithmic complexity and gives you a completely different measurement.

❌ Calling .copy() on the DataFrame before each run: it adds extra memory allocation time, inflating the timeit result while perf still measures the core computation.

Running timeit inside a Jupyter notebook without resetting the kernel, which leaves hidden state and skews results.
Using perf with the default number of runs (1) and treating the single measurement as definitive; hardware counters need warm‑up runs.
Comparing raw timeit seconds to perf cycles directly without converting cycles to time units.

When NOT to optimize

One‑off scripts: If the benchmark is a throw‑away script that runs once, the extra setup for perf isn’t worth the effort.
Tiny datasets: Under a few thousand rows, wall‑clock time differences are negligible and timeit is sufficient.
Non‑Linux environments: perf isn’t available on Windows/macOS without heavy emulation, so stick with timeit.
CI time constraints: When the CI pipeline must stay under a minute, the overhead of multiple perf runs may break the build window.

Frequently Asked Questions

Q: Can perf be used on macOS?

Not directly; macOS provides dtrace/sample but they have different semantics, so stick with timeit on that platform.

Q: Do I need root privileges for perf?

On most Linux distros, non‑root users can run perf stat if the kernel allows the perf_event_paranoid setting; otherwise sudo is required.

The key insight is that timeit and perf live in different measurement worlds—one at the interpreter level, the other at the hardware level. Once you align the tool with the layer you care about, the numbers stop fighting each other. We ran into this when a pandas aggregation that looked fast in timeit turned out to thrash the CPU cache, a fact only perf revealed, and the fix saved us milliseconds per batch in production.

timeit vs perf for microbenchmarks: cause and resolution#

Fixing the Issue#

What Doesn’t Work#

When NOT to optimize#

Frequently Asked Questions#

timeit vs perf for microbenchmarks: cause and resolution

Fixing the Issue

What Doesn’t Work

When NOT to optimize

Frequently Asked Questions