Skip to content

rec-x64: check CpuRunning between translated blocks#2309

Open
Immersion95 wants to merge 1 commit intoflyinghead:masterfrom
Immersion95:swapfix
Open

rec-x64: check CpuRunning between translated blocks#2309
Immersion95 wants to merge 1 commit intoflyinghead:masterfrom
Immersion95:swapfix

Conversation

@Immersion95
Copy link
Copy Markdown
Contributor

I spent several weeks chasing this bug, and after a lot of dead ends, this seems to be the right fix.

What first looked like a renderer or Delay Frame Swapping issue actually comes from rec-x64 stop granularity in single-threaded mode. When Delay Frame Swapping is enabled, Stop() effectively acts as a frame-boundary signal. rec-x86 observes CpuRunning between translated blocks, but rec-x64 was only checking it at the start of the run loop, so it could keep running until the end of the current timeslice before reacting.

In practice, that means the delayed frame-boundary stop is handled too late on Win64, which causes the stutter / frame pacing issues seen in non-multicore mode.

This change makes rec-x64 re-check CpuRunning between translated blocks, which brings its behavior closer to rec-x86 in this area and fixes the issue in my testing.

I specifically verified this against the long-standing Win64 + Delay Frame Swapping + non-multicore stutter case, and this is the first change that consistently fixes it for me.

I should also mention that I eventually found the root cause with the help of ChatGPT, which helped me narrow the issue down to rec-x64 stop handling.

This solves #1615 and #2308 (it might also solve some other single-threaded specific issues.)

Check CpuRunning after each translated block so Stop() is observed with
block-level granularity, like the x86 dynarec.

This fixes delayed frame-boundary handling on Win64 in single-threaded
mode with Delay Frame Swapping enabled.
@flyinghead
Copy link
Copy Markdown
Owner

The thing is the SH4 time slice is 448 cycles, which is 2.24 us on real hardware, likely much less with flycast.
We are at several orders of magnitude below the 16.67 ms frame time at 60 Hz. So it's unlikely to be a generic fix that would work on all platforms and configurations, because the difference in timing is microscopic.
It probably also has a significant impact on the x64 dynarec performance, which may not be a problem on modern hardware but could slow it down enough to fix the frame pacing issue.

@Immersion95
Copy link
Copy Markdown
Contributor Author

Immersion95 commented Apr 6, 2026

That’s a fair point :).

My take is that this is not just a slowdown effect, but an actual behavioral difference: rec-x64 was only seeing Stop() at run-loop / timeslice granularity, while rec-x86 already sees it between translated blocks.

In this specific single-threaded + Delay Frame Swapping path, Stop() basically becomes a frame-boundary signal, so that granularity seems to matter even if the slice itself is tiny.

Also, in my testing this is not only about visible stutter. With OpenGL and DX11, delayed frame swap was effectively not being applied correctly in some games, so the expected extra frame of latency reduction was not there, even when things could look subjectively smooth depending on the GPU. I could reproduce that with Street Fighter Double Impact, Street Fighter Alpha 3, and Tech Romancer / Kikaioh.

I tested this on a Ryzen 7600 and did not notice any practical performance loss, although that is of course a fairly recent CPU. If there is a cost, I would expect it to be small, maybe around ~2-3%, but that should obviously be confirmed with proper benchmarking.

I agree the extra check lives in rec-x64 itself, so any overhead would technically apply more broadly than just single-threaded mode. That said, people chasing maximum performance will usually run multicore anyway, while single-threaded mode is more about timing/accuracy and lower input lag, so I think correctness matters most in this path.

Also, this does not only fix #1615 for me, it fixes #2308 as well, which seems consistent with the same single-threaded stop-handling issue.

If you prefer, I can also make the extra check conditional to the single-threaded mode.

@Immersion95
Copy link
Copy Markdown
Contributor Author

One extra detail that may help explain why this patch works: rec-x64 seems to be the odd one out here. It was the only dynarec that was not re-checking "CpuRunning" in its dispatcher path between translated blocks. rec-x86, rec-ARM and rec-ARM64 already do, so this change mostly brings rec-x64 back in line with what the other dynarecs are already doing.

@flyinghead
Copy link
Copy Markdown
Owner

So I looked a bit more closely at this issue and this game. The problem seems to be that this game starts rendering immediately after swapping the framebuffers, probably in the vblank interrupt handler (although I need to reverse engineer the game to confirm this).
So your PR does fix the issue by stopping the dynarec immediately and before the following render happens.

The problem is that this portion of the dynarec code is probably the hottest code in the emulator: it is executed more than 20 million times per second!!! So each instruction count, and it's an expensive price to pay to fix a missing load screen in one game when running in single-threaded mode (that most people don't use).

I guess I'll do some reverse engineering on this game to figure out more precisely what it does, and try to find a cheaper solution.

@flyinghead
Copy link
Copy Markdown
Owner

One last thing: the x86-32 dynarec shouldn't be used as a reference since it's only used on some android devices, and isn't really optimized.
The most used are the arm64 and arm32 ones (and x64 of course). in the x64 dynarec, each code block is a function, and the main loop calls each block one after the other. However both arm dynarecs use "block linking", where each block has a direct pointer to the following block code (or blocks in case of a conditional branch). The dynarec returns to the main loop only after the time slice is over. In other words, they don't check CpuRunning after each block.

@Immersion95
Copy link
Copy Markdown
Contributor Author

So I looked a bit more closely at this issue and this game. The problem seems to be that this game starts rendering immediately after swapping the framebuffers, probably in the vblank interrupt handler (although I need to reverse engineer the game to confirm this). So your PR does fix the issue by stopping the dynarec immediately and before the following render happens.

The problem is that this portion of the dynarec code is probably the hottest code in the emulator: it is executed more than 20 million times per second!!! So each instruction count, and it's an expensive price to pay to fix a missing load screen in one game when running in single-threaded mode (that most people don't use).

I guess I'll do some reverse engineering on this game to figure out more precisely what it does, and try to find a cheaper solution.

Are you referring to PenPen?

Also, do you think your potential fix would handle the delayed frame swap issue in single-threaded mode for games like Street Fighter III: Double Impact and Street Fighter Zero 3 as well, the same way my "brute-force" fix does?

@flyinghead
Copy link
Copy Markdown
Owner

yes, I was referring to PenPen.
As for the fix, since it doesn't exist yet, I can't say what it would fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants