Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Neko thread usage causes seg faults during global free #281

Open
tobil4sk opened this issue Apr 4, 2023 · 5 comments · May be fixed by #304
Open

Neko thread usage causes seg faults during global free #281

tobil4sk opened this issue Apr 4, 2023 · 5 comments · May be fixed by #304

Comments

@tobil4sk
Copy link
Member

tobil4sk commented Apr 4, 2023

Ever since haxelib was updated to use threads on neko, it has been segfaulting randomly in github actions. e.g.

Command: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Installing utest from https://github.com/haxe-utest/utest branch: master
Library utest current version is now git
Command exited with 139 in 1s: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Segmentation fault (core dumped)

I haven't been able to reproduce at all on any local systems, but I did some troubleshooting and I found that the seg fault occurs after the main function is completed, at some point after this call, but before the program closes: https://github.com/HaxeFoundation/neko/blob/master/vm/main.c#L342.

I managed to download the core dump and load it, and it says that the seg fault comes from line 46 here:

neko/vm/callback.c

Lines 44 to 48 in 9076cfa

EXTERN value val_callEx( value vthis, value f, value *args, int nargs, value *exc ) {
neko_vm *vm = NEKO_VM();
value old_this = vm->vthis;
value old_env = vm->env;
value ret = val_null;

I later added a printf here and confirmed that during the segfault, vm is a null pointer. Perhaps there is a finaliser that is getting called after the main function has already finished or something?

Full backtrace
Core was generated by `haxelib git utest https://github.com/haxe-utest/utest master --always'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0)
    at /src/vm/callback.c:46
46	/src/vm/callback.c: Bad file descriptor.
[Current thread is 1 (LWP 2473)]
(gdb) bt
#0  0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0)
    at /src/vm/callback.c:46
#1  0x00007f30d8f17818 in neko_interp_loop (vm=0x7f30d77e61c0, m=0x7f30d8b4cea0, _acc=139847740806880, _pc=0x7f30d77109b8)
    at /src/vm/interp.c:708
#2  0x00007f30d8f20e24 in neko_interp (vm=0x7f30d77e61c0, _m=0x7f30d8b4cea0, acc=139847740806880, pc=0x7f30d77109b8)
    at /src/vm/interp.c:1214
#3  0x00007f30d8f15511 in neko_val_callEx (vthis=0x7f30d914f870 <t_null>, f=0x7f30d6e9b360, args=0x7f30d8b490f8, nargs=1,
    exc=0x7f30d5e3dd20) at /src/vm/callback.c:117
#4  0x00007f30d7909af1 in thread_loop (_p=0x7f30d8b490f0) at /src/libs/std/thread.c:237
#5  0x00007f30d8f26456 in ThreadMain (_p=0x7ffd92492990) at /src/vm/threads.c:122
#6  0x00007f30d8f41678 in GC_inner_start_routine () from fs/usr/local/lib/libneko.so.2
#7  0x00007f30d8f3558a in GC_call_with_stack_base () from fs/usr/local/lib/libneko.so.2
#8  0x00007f30d8f3b144 in GC_start_routine () from fs/usr/local/lib/libneko.so.2
#9  0x00007f30d8ed2609 in pwd_traced_file () from fs/lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb) bt full
#0  0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0)
    at /src/vm/callback.c:46
        vm = 0x0
        old_this = 0x0
        old_env = 0x0
        ret = 0x0
        oldjmp = {{__jmpbuf = {0, 0, 0, 0, 139845314828357, 139847775009936, 7883446016, 16}, __mask_was_saved = -706488560,
            __saved_mask = {__val = {1, 139847723636592, 139847774906397, 1, 139847770864720, 17450007603122798595, 139847750572680,
                139847723636592, 38654705672, 17450007603122798600, 139847750525952, 139847723636592, 139847774911661,
                17450007606711277424, 139847750819840, 139847750572672}}}}
#1  0x00007f30d8f17818 in neko_interp_loop (vm=0x7f30d77e61c0, m=0x7f30d8b4cea0, _acc=139847740806880, _pc=0x7f30d77109b8)
    at /src/vm/interp.c:708
        _o = 0x7f30d782a000
        _arg = 0x1
        _f = 0x7f30d8b4d8a0
        acc = 1
        pc = 0x7f30d76efe28
        instructions = {0x7f30d8f170c2 <neko_interp_loop+130>, 0x7f30d8f170dc <neko_interp_loop+156>,
          0x7f30d8f170f5 <neko_interp_loop+181>, 0x7f30d8f1710e <neko_interp_loop+206>, 0x7f30d8f17128 <neko_interp_loop+232>,
          0x7f30d8f17188 <neko_interp_loop+328>, 0x7f30d8f171ab <neko_interp_loop+363>, 0x7f30d8f171c7 <neko_interp_loop+391>,
          0x7f30d8f172d0 <neko_interp_loop+656>, 0x7f30d8f175b4 <neko_interp_loop+1396>, 0x7f30d8f18081 <neko_interp_loop+4161>,
          0x7f30d8f18417 <neko_interp_loop+5079>, 0x7f30d8f18430 <neko_interp_loop+5104>, 0x7f30d8f18453 <neko_interp_loop+5139>,
          0x7f30d8f1846f <neko_interp_loop+5167>, 0x7f30d8f18578 <neko_interp_loop+5432>, 0x7f30d8f18791 <neko_interp_loop+5969>,
          0x7f30d8f18b88 <neko_interp_loop+6984>, 0x7f30d8f18f21 <neko_interp_loop+7905>, 0x7f30d8f18f3e <neko_interp_loop+7934>,
          0x7f30d8f18f9e <neko_interp_loop+8030>, 0x7f30d8f19dc2 <neko_interp_loop+11650>, 0x7f30d8f1a804 <neko_interp_loop+14276>,
          0x7f30d8f1b24f <neko_interp_loop+16911>, 0x7f30d8f1b264 <neko_interp_loop+16932>, 0x7f30d8f1b28e <neko_interp_loop+16974>,
          0x7f30d8f1b2b8 <neko_interp_loop+17016>, 0x7f30d8f1b3c7 <neko_interp_loop+17287>, 0x7f30d8f1b4f6 <neko_interp_loop+17590>,
          0x7f30d8f1b5a2 <neko_interp_loop+17762>, 0x7f30d8f1b716 <neko_interp_loop+18134>, 0x7f30d8f1b847 <neko_interp_loop+18439>,
          0x7f30d8f1b8df <neko_interp_loop+18591>, 0x7f30d8f1b916 <neko_interp_loop+18646>, 0x7f30d8f1b94d <neko_interp_loop+18701>,
          0x7f30d8f1c72d <neko_interp_loop+22253>, 0x7f30d8f1d4d2 <neko_interp_loop+25746>, 0x7f30d8f1e269 <neko_interp_loop+29225>,
          0x7f30d8f1e822 <neko_interp_loop+30690>, 0x7f30d8f1f6d2 <neko_interp_loop+34450>, 0x7f30d8f1f910 <neko_interp_loop+35024>,
          0x7f30d8f1fb4e <neko_interp_loop+35598>, 0x7f30d8f1fd92 <neko_interp_loop+36178>, 0x7f30d8f1ffb8 <neko_interp_loop+36728>,
          0x7f30d8f201de <neko_interp_loop+37278>, 0x7f30d8f20404 <neko_interp_loop+37828>, 0x7f30d8f20487 <neko_interp_loop+37959>,
          0x7f30d8f20603 <neko_interp_loop+38339>, 0x7f30d8f20686 <neko_interp_loop+38470>, 0x7f30d8f204fd <neko_interp_loop+38077>,
          0x7f30d8f20580 <neko_interp_loop+38208>, 0x7f30d8f1b893 <neko_interp_loop+18515>, 0x7f30d8f20709 <neko_interp_loop+38601>,
--Type <RET> for more, q to quit, c to continue without paging--c
          0x7f30d8f20743 <neko_interp_loop+38659>, 0x7f30d8f20808 <neko_interp_loop+38856>, 0x7f30d8f20911 <neko_interp_loop+39121>,
          0x7f30d8f20943 <neko_interp_loop+39171>, 0x7f30d8f18fe0 <neko_interp_loop+8096>, 0x7f30d8f17161 <neko_interp_loop+289>,
          0x7f30d8f17174 <neko_interp_loop+308>, 0x7f30d8f179a7 <neko_interp_loop+2407>, 0x7f30d8f17d10 <neko_interp_loop+3280>,
          0x7f30d8f207c1 <neko_interp_loop+38785>, 0x7f30d8f1929a <neko_interp_loop+8794>, 0x7f30d8f20980 <neko_interp_loop+39232>,
          0x7f30d8f1b7a2 <neko_interp_loop+18274>, 0x7f30d8f1713e <neko_interp_loop+254>, 0x7f30d8f2098f <neko_interp_loop+39247>}
        sp = 0x7f30d6eab7a8
        csp = 0x7f30d6eab058
#2  0x00007f30d8f20e24 in neko_interp (vm=0x7f30d77e61c0, _m=0x7f30d8b4cea0, acc=139847740806880, pc=0x7f30d77109b8)
    at /src/vm/interp.c:1214
        sp = 0x7f30d6eab768
        csp = 0x7f30d6eab078
        trap = 0x7f30d6eab738
        init_sp = 7
        m = 0x7f30d8b4cea0
        old = {{__jmpbuf = {0, 4064061087093578727, 140727057721422, 140727057721423, 140727057721680, 139847723638720,
              4064061087267642343, 4064050217118686183}, __mask_was_saved = 0, __saved_mask = {__val = {0 <repeats 16 times>}}}}
#3  0x00007f30d8f15511 in neko_val_callEx (vthis=0x7f30d914f870 <t_null>, f=0x7f30d6e9b360, args=0x7f30d8b490f8, nargs=1,
    exc=0x7f30d5e3dd20) at /src/vm/callback.c:117
        n = 1
        vm = 0x7f30d77e61c0
        old_this = 0x7f30d914f870 <t_null>
        old_env = 0x7f30d914eee0 <empty_array>
        ret = 0x7f30d914f870 <t_null>
        oldjmp = {{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, __saved_mask = {__val = {0 <repeats 16 times>}}}}
#4  0x00007f30d7909af1 in thread_loop (_p=0x7f30d8b490f0) at /src/libs/std/thread.c:237
        p = 0x7f30d8b490f0
        exc = 0x0
#5  0x00007f30d8f26456 in ThreadMain (_p=0x7ffd92492990) at /src/vm/threads.c:122
        lp = 0x7ffd92492990
        p = {init = 0x7f30d7909a1b <thread_init>, main = 0x7f30d7909a99 <thread_loop>, param = 0x7f30d8b490f0, lock = {__data = {
              __lock = 2, __count = 0, __owner = 2429, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0,
                __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000}\t\000\000\001", '\000' <repeats 26 times>, __align = 2}}
#6  0x00007f30d8f41678 in GC_inner_start_routine () from fs/usr/local/lib/libneko.so.2
No symbol table info available.
#7  0x00007f30d8f3558a in GC_call_with_stack_base () from fs/usr/local/lib/libneko.so.2
No symbol table info available.
#8  0x00007f30d8f3b144 in GC_start_routine () from fs/usr/local/lib/libneko.so.2
No symbol table info available.
#9  0x00007f30d8ed2609 in pwd_traced_file () from fs/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#10 0x0000000000000000 in ?? ()
No symbol table info available.

Here is the code in haxelib that uses threads: https://github.com/HaxeFoundation/haxelib/blob/4.1.x/src/haxelib/client/Vcs.hx#L162-L177

tobil4sk added a commit to HaxeFoundation/haxelib that referenced this issue Apr 4, 2023
On Ubuntu, threads can cause seg faults, see:
HaxeFoundation/neko#281
tobil4sk added a commit to tobil4sk/haxec that referenced this issue Apr 4, 2023
Haxelib was causing CI failures in the Ubuntu runners due to a threading
issue with neko:

HaxeFoundation/neko#281
kLabz pushed a commit to HaxeFoundation/haxe that referenced this issue Apr 4, 2023
* Patch haxelib to avoid segmentation faults

Haxelib was causing CI failures in the Ubuntu runners due to a threading
issue with neko:

HaxeFoundation/neko#281

* Update haxelib for run.n fix
tobil4sk added a commit to HaxeFoundation/haxelib that referenced this issue Apr 6, 2023
On Ubuntu, threads can cause seg faults, see:
HaxeFoundation/neko#281
@tobil4sk
Copy link
Member Author

We just had a similar crash on Windows, so looks like it's not specific to Linux:

Command: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Installing utest from https://github.com/haxe-utest/utest branch: master
Library utest current version is now git
Command exited with -1073741819 in 3s: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]

-1073741819 is equivalent to 0xC0000005, which is STATUS_ACCESS_VIOLATION: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55

@tobil4sk
Copy link
Member Author

This sample seems to reproduce the seg fault some of the time, at least on my windows machine:

function main() {
	final streamsLock = new sys.thread.Lock();

	sys.thread.Thread.create(function() {
		Sys.sleep(0.2);
		streamsLock.release();
	});

	sys.thread.Thread.create(function() {
		Sys.sleep(0.2);
		streamsLock.release();
	});

	streamsLock.wait();
	streamsLock.wait();
}

@tobil4sk tobil4sk changed the title Neko threads cause seg faults in Ubuntu github actions environment Neko thread usage causes seg faults during global free Aug 31, 2024
@tobil4sk
Copy link
Member Author

On windows, the above sample also sometimes causes this popup:

Image

@tobil4sk
Copy link
Member Author

Here is a haxe sample that reproduces the seg fault more reliably:

function main() {
	sys.thread.Thread.create(function() {
		while(true) {
			trace("Hello 1");
		}
	});
	sys.thread.Thread.create(function() {
		while (true) {
			trace("Hello 2");
		}
	});
}

@tobil4sk
Copy link
Member Author

tobil4sk commented Feb 14, 2025

On windows, the above sample also sometimes causes this popup:

It looks like this happens because the thread is deleted by DLLMain
https://github.com/ivmai/bdwgc/blob/2558568aceaf7fc5cc64cf87e244cbcfd7f9bd53/win32_threads.c#L3009

Somehow this happens at the same time as the GC_gcollect call within neko_gc_major() while neko is shutting down, which also tries to access the same thread to suspend it.

See separate issue: #303

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant