gh-142183: Cache one datachunk per tstate to prevent alloc/dealloc thrashing#145789
gh-142183: Cache one datachunk per tstate to prevent alloc/dealloc thrashing#145789Yhg1s wants to merge 3 commits intopython:mainfrom
Conversation
repeatedly hitting the same call depth at exactly the wrong boundary.
|
Just to be clear: this is effectively a freelist of 1, and there's still an easily crafted (but much less likely in reality, I would argue) case where two (or more) stack chunks are repeatedly allocated and deallocated. That requires a much larger chain of calls -- or much larger functions -- so it's not as pronounced, but crafting code to hit that exact case isn't hard. It shows a ~15% penalty for being at just the wrong stack depth, compared to 35+% for the single chunk case. I considered making the cached chunk a freelist (which would be easy since they're already a linked list) but this would mean keeping all datastack chunks of a thread alive for the entire duration of a thread, which might not be a good idea. Caching a single chunk seems like a reasonable compromise. Here's some benchmark results using the repro I provided in the issue, run on a not particularly quiet machine so the results are a little noisy. 55 is the stack depth level that triggers the bad case, 56 is one level deeper (so slightly more work) and avoids it: |
markshannon
left a comment
There was a problem hiding this comment.
Looks good. Thanks for fixing this.
Do you want to backport this to 3.14 and maybe 3.13?
Note:
Although this is a solid fix for 3.14 and 31.3, we'll probably want to use a resizable stack for 3.15+ to avoid deopts in the SAI and JIT when operating at the edge of a chunk
Cache one datachunk per tstate to prevent alloc/dealloc thrashing when repeatedly hitting the same call depth at exactly the wrong boundary.