Graphical corruption and memory page faults on Vega 56/64 under Linux #1792

CodingTwist · 2023-06-01T19:10:52Z

Version information

mc1.19.4-0.4.10+build.24

Expected Behavior

Game renders

Actual Behavior

Game doesn't render. Creating huge artifacts. While bring the GPU to 100%

Reproduction Steps

Launch the game
Join a world and wait a few seconds

Java version

Java 17.0.7 & Java 20.0.1

CPU

Intel i7-8700

GPU

AMD ATI Radeon RX Vega 56/64

Additional information

I am running Arch Linux on 6.3.5-arch1-1 with a AMD GPU.

I was asked to launch the mod with Fabric API api which had no effect. Vanilla Minecraft runs fine and optifine works

This was the log after launching the game then once it began lagging force killing the game.
https://paste.ee/p/yqLZu

The only sort of error I am getting is in the kernel buffer.

[  191.917437] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
[  191.920212] amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:6 pasid:32778, for process java pid 2986 thread java:cs0 pid 3064)
[  191.920233] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080011a86c000 from IH client 0x1b (UTCL2)
[  191.920246] amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00601030
[  191.920253] amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[  191.920259] amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  191.920264] amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  191.920270] amdgpu 0000:03:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[  191.920274] amdgpu 0000:03:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  191.920279] amdgpu 0000:03:00.0: amdgpu: 	 RW: 0x0
[  201.943945] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered

GPU driver info:

OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon RX Vega (vega10, LLVM 15.0.7, DRM 3.52, 6.3.5-arch1-1)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.1.1
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.1.1
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.2 Mesa 23.1.1
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
OpenGL ES profile extensions:

Please just ask if you need more info about my system

The text was updated successfully, but these errors were encountered:

Motschen · 2023-06-04T09:43:50Z

I'm also encountering the same issue, but instead of just crashing the game, it crashes the whole compositor for me, both on Hyprland using Wayland and on KDE using X11.
Seems to be caused by a recent mesa update, as this just started happening after a system update.

ghost · 2023-06-05T03:11:36Z

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Regular-Baf · 2023-06-16T16:56:58Z

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Pretty sure I'm having the exact same issue on Vega 64 (Mesa 23.1.2 on Fedora 38 Plasma Wayland). Changing Async to Swap does resolve it, as does running Minecraft through Zink. I've had nothing but stability issues with Vega across OpenGL/OpenCL for years, so maybe this is a Mesa or amdgpu issue more than a Sodium issue.

ZtereoHYPE · 2023-06-30T12:00:30Z

Encountered the same issue on a friend's system, and joining a world brought the entire system down to a screen-flickering state. AMD Vega 64, Mesa 23.1.3, plasma X11. Switching to swap also seems to fix it.

RedMaster13 · 2023-06-30T16:28:40Z

I'm having the same issue here. AMD Vega 56, Arch Linux. Downgrading to mesa 23.0.3 fixed my issue.

jellysquid3 · 2023-07-01T16:25:42Z

Hm. I haven't been able to reproduce any of these issues on my system (RX 6900 XT, Mesa 23.1.2, Linux 6.3.8), but it also seems that this problem exclusively affects the Vega 56/64 (which are a known problem child on Linux...)

The problem seems to be related to persistently mapped memory under OpenGL, hence the reason why switching the "Chunk Memory Allocator" strategy to "Swap" fixes the crashes. Both the corruption and hardware page faults would seem to agree with this.

I am going to see if we can bisect where the problem appeared in Mesa, and look into filing a bug. They've been helpful in the past with these things, so I think we have a good chance at fixing this.

To be clear, I don't think there is any bug with Sodium here, rather this is a regression in the Mesa graphics stack.

jellysquid3 · 2023-07-01T16:29:20Z

For the time being, the solutions we've seen solve this problem are:

Using the Zink driver (set the environment variable MESA_LOADER_DRIVER_OVERRIDE=zink for Minecraft, might not perform well.)
Changing the setting at Video Settings > Advanced > Chunk Memory Allocator to "SWAP" (will likely degrade performance severely.)
Downgrading to Mesa 23.0.3 (unverified, but one other user said it worked.)

jellysquid3 · 2023-07-20T00:40:25Z

We do not have any way to debug or fix this. The problem seems exclusively limited to the Vega 56/64 (and professional cards of that series) and we do not have any such graphics cards on hand. That said, I'm almost certain this problem has nothing to do with Sodium, as there's no good explanation for what could be going wrong on our side.

The only option here would be to make a bug report to Mesa about this problem. I suspect it would help them a lot if you could provide an API trace.

wingedseahorse · 2023-07-20T02:41:46Z

Downgrading to Mesa 23.0.3 (unverified, but one other user said it worked.)

This is working for me as well.

electron271 · 2023-08-10T14:31:10Z

* Downgrading to Mesa 23.0.3 (unverified, but one other user said it worked.)

Working as well

Bettehem · 2023-08-11T02:41:26Z

I'm using the Zink workaround as downgrading Mesa isn't a viable option for me. Works nicely without shaders but when using shaders, Zink's performance isn't very good

jellysquid3 · 2023-08-14T04:56:30Z

This might be accidentally fixed with Sodium 0.5.1 since we now use a 16-byte alignment on vertex data.

Regular-Baf · 2023-08-27T06:49:59Z

I've just tested Sodium 0.5.2 and unfortunately the system freeze still occurs.

goeiecool9999 · 2023-10-02T20:29:56Z

Bisected to this commit. Unfortunately it's not cleanly reversible on later versions.

goeiecool9999 · 2023-10-02T21:40:42Z

I have opened an issue on the mesa repo.

BIGFAAT · 2023-10-05T16:36:56Z

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Option is in newer versions not available anymore, forcing vega user to start with MESA_LOADER_DRIVER_OVERRIDE=zink.
Please rollback.

KnownDimension · 2023-11-23T19:19:56Z

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Option is in newer versions not available anymore, forcing vega user to start with MESA_LOADER_DRIVER_OVERRIDE=zink. Please rollback.

I tried that a couple of weeks ago, the current version of zink is broken globally on Vega 56 Linux rn so that workaround is out the window

(Nixos for reference)

an0nfunc · 2023-11-23T20:27:44Z

Works fine for me on Arch with zink.

jellysquid3 · 2023-11-23T21:00:10Z

Sorry. We are not going to re-implement the option people were using to workaround this problem. If it is useful, a technical explanation is provided below for why the option ever existed, and why it was removed.

Technical explanation...

The problem

Normally, Sodium uses asynchronous transfers (buffer copies which are put into the GPU's command stream) and a staging buffer (mapped persistently within host memory) to upload geometry data to the GPU. We heavily rely on this functionality for good performance, and most other games will do something similar.

While OpenGL does have alternative ways to upload data to the GPU (i.e. glBufferSubData), it has very poor performance when updating an only certain parts of a buffer, and it requires additional memory copies. This is a problem, because we use very large shared buffers for our geometry, and implement a custom memory allocator on top of them.

(As an aside, it's worth mentioning that DirectX 12 and Vulkan only provide you with this option for uploading data to the GPU -- the driver does not hold your hand.)

More importantly: Our memory management strategy in Sodium directly relates to how we can optimize rendering. Using fewer buffer objects means we can switch between resource sets much less frequently, which in turn allows us to pack hundreds of draw commands into a single draw call.

Why the option ever existed in the first place

To workaround the broken support for asynchronous transfers on Apple's M1 hardware, we implemented an alternative approach which we called "swapping" (for disambiguation sake.)

Essentially, that approach involved keeping a copy of all chunk geometry in the CPU's memory, and each time a chunk was updated, we would allocate a new geometry buffer, and re-upload all the chunks into it. Hence the name "swap" -- it was swapping the geometry buffer each time.

Obviously, this is a very slow thing to do, and it meant updating chunks (such as when placing or breaking blocks) would cause significant lag, since it needs to constantly re-allocate and transfer huge amounts of memory. Another consequence was that we needed three copies of the geometry data, which doubled the memory requirements of the game.

Why the option was removed

When our hardware support policy changed (to require OpenGL 4.5 support), none of Apple's computers met this requirement any longer, so we dropped support for this workaround. We then took advantage of that to refactor the code for better performance and to fix a number of long-standing issues.

Because of this, I don't think there's any chance we could restore the workaround without undoing a lot of technical changes, and introducing a lot of technical debt back into the project. And I really don't want to implement more workarounds for critical functionality (asynchronous transfers) being plainly broken.

Anyways. There's really not much more point to keeping this issue open, because the only remaining actionable part here would be to implement more workarounds, which we are not willing to do (see above reasoning.)

The Mesa developers are already aware of this issue and the cause of the regression has been bisected. There is not much else that can be done to help them (at least to my knowledge) other than to provide them with an apitrace file. They have a lot of things to do, and I am not going to push for users to nag them.

electron271 · 2023-12-13T23:46:19Z

Sorry to bother but is there any workaround that does not involve zink or downgrading? Zink heavily impacts shader performance, and downgrading breaks a lot of stuff.

BIGFAAT · 2023-12-14T08:14:51Z

Sadly not, but looks like someone got assigned to the bug on the stated MESA issue. So keep a look there.

wingedseahorse · 2023-12-14T14:47:54Z

Sorry to bother but is there any workaround that does not involve zink or downgrading? Zink heavily impacts shader performance, and downgrading breaks a lot of stuff.

At this point I'm having to accept the best solution is just to switch back to Forge until Mesa resolves since downgrading no longer works for me.

electron271 · 2023-12-14T18:04:43Z

Sadly not, but looks like someone got assigned to the bug on the stated MESA issue. So keep a look there.

Hopefully it gets fixed soon

Jaggwagg · 2024-01-12T20:46:15Z

For anyone experiencing issues with loading Zink drivers, this article helped me fix it https://www.supergoodcode.com/preemptive/.

goeiecool9999 · 2024-04-01T14:36:05Z

I am on kernel 6.8.1 and mesa 24.0.4. The issue seems to be gone!

0-x-2-2 · 2024-04-01T22:59:50Z

very nice

pajicadvance · 2024-04-02T03:56:35Z

This issue was listed as fixed in the Mesa 24.0.4 release notes. The issue has an identical crash and GPU architecture as this one, so I assume that is what fixed it.

CodingTwist added S-needs-triage Status: Needs triage T-bug Type: Bug labels Jun 1, 2023

jellysquid3 added A-drivers Area: Driver compatibility and removed S-needs-triage Status: Needs triage labels Jul 1, 2023

jellysquid3 mentioned this issue Jul 20, 2023

Sodium freezes desktop completely and causes graphical issues in minecraft #1831

Closed

jellysquid3 changed the title ~~[gfxhub0] no-retry page fault. Game unplayable GPU: 100%~~ Graphical corruption and memory page faults on Vega 56/64 under Linux Jul 20, 2023

jellysquid3 added the F-help-wanted Flag: Help wanted label Jul 20, 2023

jellysquid3 added the R-has-workaround Resolution: Has workaround label Jul 22, 2023

jellysquid3 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 23, 2023

jellysquid3 added E-will-not-fix Closed: This will not be worked on and removed R-has-workaround Resolution: Has workaround labels Nov 23, 2023

pajicadvance removed the F-help-wanted Flag: Help wanted label Apr 2, 2024

pajicadvance closed this as completed Apr 2, 2024

MeeniMc mentioned this issue Apr 26, 2024

Closing the game crashes the desktop environment #2433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graphical corruption and memory page faults on Vega 56/64 under Linux #1792

Graphical corruption and memory page faults on Vega 56/64 under Linux #1792

CodingTwist commented Jun 1, 2023

Motschen commented Jun 4, 2023 •

edited

Loading

ghost commented Jun 5, 2023

Regular-Baf commented Jun 16, 2023

ZtereoHYPE commented Jun 30, 2023

RedMaster13 commented Jun 30, 2023

jellysquid3 commented Jul 1, 2023 •

edited

Loading

jellysquid3 commented Jul 1, 2023

jellysquid3 commented Jul 20, 2023

wingedseahorse commented Jul 20, 2023

electron271 commented Aug 10, 2023

Bettehem commented Aug 11, 2023

jellysquid3 commented Aug 14, 2023

Regular-Baf commented Aug 27, 2023

goeiecool9999 commented Oct 2, 2023

goeiecool9999 commented Oct 2, 2023

BIGFAAT commented Oct 5, 2023

KnownDimension commented Nov 23, 2023

an0nfunc commented Nov 23, 2023

jellysquid3 commented Nov 23, 2023 •

edited

Loading

Technical explanation...

The problem

Why the option ever existed in the first place

Why the option was removed

electron271 commented Dec 13, 2023

BIGFAAT commented Dec 14, 2023

wingedseahorse commented Dec 14, 2023

electron271 commented Dec 14, 2023

Jaggwagg commented Jan 12, 2024 •

edited

Loading

goeiecool9999 commented Apr 1, 2024

0-x-2-2 commented Apr 1, 2024

pajicadvance commented Apr 2, 2024

Graphical corruption and memory page faults on Vega 56/64 under Linux #1792

Graphical corruption and memory page faults on Vega 56/64 under Linux #1792

Comments

CodingTwist commented Jun 1, 2023

Version information

Expected Behavior

Actual Behavior

Reproduction Steps

Java version

CPU

GPU

Additional information

Motschen commented Jun 4, 2023 • edited Loading

ghost commented Jun 5, 2023

Regular-Baf commented Jun 16, 2023

ZtereoHYPE commented Jun 30, 2023

RedMaster13 commented Jun 30, 2023

jellysquid3 commented Jul 1, 2023 • edited Loading

jellysquid3 commented Jul 1, 2023

jellysquid3 commented Jul 20, 2023

wingedseahorse commented Jul 20, 2023

electron271 commented Aug 10, 2023

Bettehem commented Aug 11, 2023

jellysquid3 commented Aug 14, 2023

Regular-Baf commented Aug 27, 2023

goeiecool9999 commented Oct 2, 2023

goeiecool9999 commented Oct 2, 2023

BIGFAAT commented Oct 5, 2023

KnownDimension commented Nov 23, 2023

an0nfunc commented Nov 23, 2023

jellysquid3 commented Nov 23, 2023 • edited Loading

Technical explanation...

The problem

Why the option ever existed in the first place

Why the option was removed

electron271 commented Dec 13, 2023

BIGFAAT commented Dec 14, 2023

wingedseahorse commented Dec 14, 2023

electron271 commented Dec 14, 2023

Jaggwagg commented Jan 12, 2024 • edited Loading

goeiecool9999 commented Apr 1, 2024

0-x-2-2 commented Apr 1, 2024

pajicadvance commented Apr 2, 2024

Motschen commented Jun 4, 2023 •

edited

Loading

jellysquid3 commented Jul 1, 2023 •

edited

Loading

jellysquid3 commented Nov 23, 2023 •

edited

Loading

Jaggwagg commented Jan 12, 2024 •

edited

Loading