Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Graphical corruption and memory page faults on Vega 56/64 under Linux #1792

Closed
CodingTwist opened this issue Jun 1, 2023 · 27 comments
Closed
Labels
A-drivers Area: Driver compatibility E-will-not-fix Closed: This will not be worked on T-bug Type: Bug

Comments

@CodingTwist
Copy link

Version information

mc1.19.4-0.4.10+build.24

Expected Behavior

Game renders

Actual Behavior

Game doesn't render. Creating huge artifacts. While bring the GPU to 100%

Reproduction Steps

Launch the game
Join a world and wait a few seconds

Java version

Java 17.0.7 & Java 20.0.1

CPU

Intel i7-8700

GPU

AMD ATI Radeon RX Vega 56/64

Additional information

I am running Arch Linux on 6.3.5-arch1-1 with a AMD GPU.

image

I was asked to launch the mod with Fabric API api which had no effect. Vanilla Minecraft runs fine and optifine works

This was the log after launching the game then once it began lagging force killing the game.
https://paste.ee/p/yqLZu

The only sort of error I am getting is in the kernel buffer.

[  191.917437] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
[  191.920212] amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:6 pasid:32778, for process java pid 2986 thread java:cs0 pid 3064)
[  191.920233] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080011a86c000 from IH client 0x1b (UTCL2)
[  191.920246] amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00601030
[  191.920253] amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[  191.920259] amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  191.920264] amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  191.920270] amdgpu 0000:03:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[  191.920274] amdgpu 0000:03:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  191.920279] amdgpu 0000:03:00.0: amdgpu: 	 RW: 0x0
[  201.943945] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered

GPU driver info:

OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon RX Vega (vega10, LLVM 15.0.7, DRM 3.52, 6.3.5-arch1-1)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.1.1
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.1.1
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.2 Mesa 23.1.1
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
OpenGL ES profile extensions:

Please just ask if you need more info about my system

@CodingTwist CodingTwist added S-needs-triage Status: Needs triage T-bug Type: Bug labels Jun 1, 2023
@Motschen
Copy link

Motschen commented Jun 4, 2023

I'm also encountering the same issue, but instead of just crashing the game, it crashes the whole compositor for me, both on Hyprland using Wayland and on KDE using X11.
Seems to be caused by a recent mesa update, as this just started happening after a system update.

@ghost
Copy link

ghost commented Jun 5, 2023

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

@Regular-Baf
Copy link

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Pretty sure I'm having the exact same issue on Vega 64 (Mesa 23.1.2 on Fedora 38 Plasma Wayland). Changing Async to Swap does resolve it, as does running Minecraft through Zink. I've had nothing but stability issues with Vega across OpenGL/OpenCL for years, so maybe this is a Mesa or amdgpu issue more than a Sodium issue.

@ZtereoHYPE
Copy link

Encountered the same issue on a friend's system, and joining a world brought the entire system down to a screen-flickering state. AMD Vega 64, Mesa 23.1.3, plasma X11. Switching to swap also seems to fix it.

@RedMaster13
Copy link

I'm having the same issue here. AMD Vega 56, Arch Linux. Downgrading to mesa 23.0.3 fixed my issue.

@jellysquid3
Copy link
Member

jellysquid3 commented Jul 1, 2023

Hm. I haven't been able to reproduce any of these issues on my system (RX 6900 XT, Mesa 23.1.2, Linux 6.3.8), but it also seems that this problem exclusively affects the Vega 56/64 (which are a known problem child on Linux...)

The problem seems to be related to persistently mapped memory under OpenGL, hence the reason why switching the "Chunk Memory Allocator" strategy to "Swap" fixes the crashes. Both the corruption and hardware page faults would seem to agree with this.

I am going to see if we can bisect where the problem appeared in Mesa, and look into filing a bug. They've been helpful in the past with these things, so I think we have a good chance at fixing this.

To be clear, I don't think there is any bug with Sodium here, rather this is a regression in the Mesa graphics stack.

@jellysquid3 jellysquid3 added A-drivers Area: Driver compatibility and removed S-needs-triage Status: Needs triage labels Jul 1, 2023
@jellysquid3
Copy link
Member

For the time being, the solutions we've seen solve this problem are:

  • Using the Zink driver (set the environment variable MESA_LOADER_DRIVER_OVERRIDE=zink for Minecraft, might not perform well.)
  • Changing the setting at Video Settings > Advanced > Chunk Memory Allocator to "SWAP" (will likely degrade performance severely.)
  • Downgrading to Mesa 23.0.3 (unverified, but one other user said it worked.)

@jellysquid3 jellysquid3 changed the title [gfxhub0] no-retry page fault. Game unplayable GPU: 100% Graphical corruption and memory page faults on Vega 56/64 under Linux Jul 20, 2023
@jellysquid3 jellysquid3 added the F-help-wanted Flag: Help wanted label Jul 20, 2023
@jellysquid3
Copy link
Member

We do not have any way to debug or fix this. The problem seems exclusively limited to the Vega 56/64 (and professional cards of that series) and we do not have any such graphics cards on hand. That said, I'm almost certain this problem has nothing to do with Sodium, as there's no good explanation for what could be going wrong on our side.

The only option here would be to make a bug report to Mesa about this problem. I suspect it would help them a lot if you could provide an API trace.

@wingedseahorse
Copy link

  • Downgrading to Mesa 23.0.3 (unverified, but one other user said it worked.)

This is working for me as well.

@jellysquid3 jellysquid3 added the R-has-workaround Resolution: Has workaround label Jul 22, 2023
@electron271
Copy link

* Downgrading to Mesa 23.0.3 (unverified, but one other user said it worked.)

Working as well

@Bettehem
Copy link

I'm using the Zink workaround as downgrading Mesa isn't a viable option for me. Works nicely without shaders but when using shaders, Zink's performance isn't very good

@jellysquid3
Copy link
Member

This might be accidentally fixed with Sodium 0.5.1 since we now use a 16-byte alignment on vertex data.

@Regular-Baf
Copy link

I've just tested Sodium 0.5.2 and unfortunately the system freeze still occurs.

@goeiecool9999
Copy link

Bisected to this commit. Unfortunately it's not cleanly reversible on later versions.

@goeiecool9999
Copy link

I have opened an issue on the mesa repo.

@BIGFAAT
Copy link

BIGFAAT commented Oct 5, 2023

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Option is in newer versions not available anymore, forcing vega user to start with MESA_LOADER_DRIVER_OVERRIDE=zink.
Please rollback.

@KnownDimension
Copy link

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Option is in newer versions not available anymore, forcing vega user to start with MESA_LOADER_DRIVER_OVERRIDE=zink. Please rollback.

I tried that a couple of weeks ago, the current version of zink is broken globally on Vega 56 Linux rn so that workaround is out the window

(Nixos for reference)

@an0nfunc
Copy link

Works fine for me on Arch with zink.

@jellysquid3
Copy link
Member

jellysquid3 commented Nov 23, 2023

Sorry. We are not going to re-implement the option people were using to workaround this problem. If it is useful, a technical explanation is provided below for why the option ever existed, and why it was removed.

Technical explanation...

The problem

Normally, Sodium uses asynchronous transfers (buffer copies which are put into the GPU's command stream) and a staging buffer (mapped persistently within host memory) to upload geometry data to the GPU. We heavily rely on this functionality for good performance, and most other games will do something similar.

While OpenGL does have alternative ways to upload data to the GPU (i.e. glBufferSubData), it has very poor performance when updating an only certain parts of a buffer, and it requires additional memory copies. This is a problem, because we use very large shared buffers for our geometry, and implement a custom memory allocator on top of them.

(As an aside, it's worth mentioning that DirectX 12 and Vulkan only provide you with this option for uploading data to the GPU -- the driver does not hold your hand.)

More importantly: Our memory management strategy in Sodium directly relates to how we can optimize rendering. Using fewer buffer objects means we can switch between resource sets much less frequently, which in turn allows us to pack hundreds of draw commands into a single draw call.

Why the option ever existed in the first place

To workaround the broken support for asynchronous transfers on Apple's M1 hardware, we implemented an alternative approach which we called "swapping" (for disambiguation sake.)

Essentially, that approach involved keeping a copy of all chunk geometry in the CPU's memory, and each time a chunk was updated, we would allocate a new geometry buffer, and re-upload all the chunks into it. Hence the name "swap" -- it was swapping the geometry buffer each time.

Obviously, this is a very slow thing to do, and it meant updating chunks (such as when placing or breaking blocks) would cause significant lag, since it needs to constantly re-allocate and transfer huge amounts of memory. Another consequence was that we needed three copies of the geometry data, which doubled the memory requirements of the game.

Why the option was removed

When our hardware support policy changed (to require OpenGL 4.5 support), none of Apple's computers met this requirement any longer, so we dropped support for this workaround. We then took advantage of that to refactor the code for better performance and to fix a number of long-standing issues.

Because of this, I don't think there's any chance we could restore the workaround without undoing a lot of technical changes, and introducing a lot of technical debt back into the project. And I really don't want to implement more workarounds for critical functionality (asynchronous transfers) being plainly broken.


Anyways. There's really not much more point to keeping this issue open, because the only remaining actionable part here would be to implement more workarounds, which we are not willing to do (see above reasoning.)

The Mesa developers are already aware of this issue and the cause of the regression has been bisected. There is not much else that can be done to help them (at least to my knowledge) other than to provide them with an apitrace file. They have a lot of things to do, and I am not going to push for users to nag them.

@jellysquid3 jellysquid3 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 23, 2023
@jellysquid3 jellysquid3 added E-will-not-fix Closed: This will not be worked on and removed R-has-workaround Resolution: Has workaround labels Nov 23, 2023
@electron271
Copy link

Sorry to bother but is there any workaround that does not involve zink or downgrading? Zink heavily impacts shader performance, and downgrading breaks a lot of stuff.

@BIGFAAT
Copy link

BIGFAAT commented Dec 14, 2023

Sadly not, but looks like someone got assigned to the bug on the stated MESA issue. So keep a look there.

@wingedseahorse
Copy link

Sorry to bother but is there any workaround that does not involve zink or downgrading? Zink heavily impacts shader performance, and downgrading breaks a lot of stuff.

At this point I'm having to accept the best solution is just to switch back to Forge until Mesa resolves since downgrading no longer works for me.

@electron271
Copy link

Sadly not, but looks like someone got assigned to the bug on the stated MESA issue. So keep a look there.

Hopefully it gets fixed soon

@Jaggwagg
Copy link

Jaggwagg commented Jan 12, 2024

For anyone experiencing issues with loading Zink drivers, this article helped me fix it https://www.supergoodcode.com/preemptive/.

@goeiecool9999
Copy link

I am on kernel 6.8.1 and mesa 24.0.4. The issue seems to be gone!

@0-x-2-2
Copy link
Contributor

0-x-2-2 commented Apr 1, 2024

very nice

@pajicadvance
Copy link
Member

This issue was listed as fixed in the Mesa 24.0.4 release notes. The issue has an identical crash and GPU architecture as this one, so I assume that is what fixed it.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-drivers Area: Driver compatibility E-will-not-fix Closed: This will not be worked on T-bug Type: Bug
Projects
None yet
Development

No branches or pull requests