Content-Length: 324626 | pFad | http://github.com/python/cpython/pull/145789

4F gh-142183: Cache one datachunk per tstate to prevent alloc/dealloc thrashing by Yhg1s · Pull Request #145789 · python/cpython · GitHub
Skip to content

gh-142183: Cache one datachunk per tstate to prevent alloc/dealloc thrashing#145789

Open
Yhg1s wants to merge 3 commits intopython:mainfrom
Yhg1s:cache-datachunk
Open

gh-142183: Cache one datachunk per tstate to prevent alloc/dealloc thrashing#145789
Yhg1s wants to merge 3 commits intopython:mainfrom
Yhg1s:cache-datachunk

Conversation

@Yhg1s
Copy link
Member

@Yhg1s Yhg1s commented Mar 11, 2026

Cache one datachunk per tstate to prevent alloc/dealloc thrashing when repeatedly hitting the same call depth at exactly the wrong boundary.

repeatedly hitting the same call depth at exactly the wrong boundary.
@Yhg1s
Copy link
Member Author

Yhg1s commented Mar 11, 2026

Just to be clear: this is effectively a freelist of 1, and there's still an easily crafted (but much less likely in reality, I would argue) case where two (or more) stack chunks are repeatedly allocated and deallocated. That requires a much larger chain of calls -- or much larger functions -- so it's not as pronounced, but crafting code to hit that exact case isn't hard. It shows a ~15% penalty for being at just the wrong stack depth, compared to 35+% for the single chunk case. I considered making the cached chunk a freelist (which would be easy since they're already a linked list) but this would mean keeping all datastack chunks of a thread alive for the entire duration of a thread, which might not be a good idea. Caching a single chunk seems like a reasonable compromise.

Here's some benchmark results using the repro I provided in the issue, run on a not particularly quiet machine so the results are a little noisy. 55 is the stack depth level that triggers the bad case, 56 is one level deeper (so slightly more work) and avoids it:

% hyperfine --warmup 3 './base/python repro.py' './fixed/python repro.py'
Benchmark 1: ./base/python repro.py
  Time (mean ± σ):      24.1 ms ±   3.8 ms    [User: 15.9 ms, System: 7.7 ms]
  Range (min … max):    20.9 ms …  41.8 ms    112 runs

Benchmark 2: ./fixed/python repro.py
  Time (mean ± σ):      18.4 ms ±   2.0 ms    [User: 14.8 ms, System: 3.3 ms]
  Range (min … max):    16.4 ms …  28.3 ms    150 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./fixed/python repro.py ran
    1.31 ± 0.25 times faster than ./base/python repro.py
%  hyperfine --warmup 3 './base/python repro.py 55' './base/python repro.py 56'
Benchmark 1: ./base/python repro.py 55
  Time (mean ± σ):      21.6 ms ±   2.4 ms    [User: 14.6 ms, System: 6.7 ms]
  Range (min … max):    19.5 ms …  31.5 ms    128 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./base/python repro.py 56
  Time (mean ± σ):      16.7 ms ±   1.3 ms    [User: 13.7 ms, System: 2.8 ms]
  Range (min … max):    15.3 ms …  25.1 ms    165 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./base/python repro.py 56 ran
    1.30 ± 0.17 times faster than ./base/python repro.py 55
% hyperfine --warmup 3 './fixed/python repro.py 55' './fixed/python repro.py 56'
Benchmark 1: ./fixed/python repro.py 55
  Time (mean ± σ):      17.1 ms ±   1.5 ms    [User: 13.8 ms, System: 3.1 ms]
  Range (min … max):    15.6 ms …  24.6 ms    164 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./fixed/python repro.py 56
  Time (mean ± σ):      17.1 ms ±   2.0 ms    [User: 13.8 ms, System: 3.1 ms]
  Range (min … max):    15.8 ms …  26.4 ms    170 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./fixed/python repro.py 55 ran
    1.00 ± 0.15 times faster than ./fixed/python repro.py 56

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for fixing this.

Do you want to backport this to 3.14 and maybe 3.13?

Note:
Although this is a solid fix for 3.14 and 31.3, we'll probably want to use a resizable stack for 3.15+ to avoid deopts in the SAI and JIT when operating at the edge of a chunk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/python/cpython/pull/145789

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy