Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Investigate test-worker-fshandles-open-close-on-termination and test-worker-fshandles-error-on-termination crash #43499

Closed
F3n67u opened this issue Jun 20, 2022 · 4 comments · Fixed by #43533
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@F3n67u
Copy link
Member

F3n67u commented Jun 20, 2022

Test

test-worker-fshandles-open-close-on-termination
test-worker-fshandles-error-on-termination

Platform

test-worker-fshandles-open-close-on-termination on MacOS
test-worker-fshandles-error-on-termination on Windows

Console output

test-worker-fshandles-open-close-on-termination:

out/Release/node[53255]: ../src/node_file-inl.h:162:virtual node::fs::FSReqPromise<node::AliasedBufferBase<double, v8::Float64Array, void>>::~FSReqPromise() [AliasedBufferT = node::AliasedBufferBase<double, v8::Float64Array, void>]: Assertion `finished_' failed.
 1: 0x10767eeb5 node::Abort() [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 2: 0x10767ecd1 node::Assert(node::AssertionInfo const&) [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 3: 0x10767a806 node::fs::FSReqPromise<node::AliasedBufferBase<double, v8::Float64Array, void> >::~FSReqPromise() [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 4: 0x107683cc9 node::fs::FSReqAfterScope::~FSReqAfterScope() [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 5: 0x1076841fd node::fs::AfterOpenFileHandle(uv_fs_s*) [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 6: 0x1081162f8 uv__work_done [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 7: 0x10811a6eb uv__async_io [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 8: 0x10812dd6e uv__io_poll [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
 9: 0x10811abe1 uv_run [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
10: 0x1075aee23 node::SpinEventLoop(node::Environment*) [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
11: 0x1077268bf node::worker::Worker::Run() [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
12: 0x10772a302 node::worker::Worker::StartThread(v8::FunctionCallbackInfo<v8::Value> const&)::$_3::__invoke(void*) [/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx11-x64/out/Release/node]
13: 0x7fff203e58fc _pthread_start [/usr/lib/system/libsystem_pthread.dylib]
14: 0x7fff203e1443 thread_start [/usr/lib/system/libsystem_pthread.dylib]

test-worker-fshandles-error-on-termination:

Administrator: npm install[8184]: C:\workspace\node-compile-windows\node\src\node_file-inl.h:162: Assertion `finished_' failed.
 1: 00007FF610F69CCF node_api_throw_syntax_error+176223
 2: 00007FF610EF7B66 v8::internal::wasm::WasmCode::safepoint_table_offset+67398
 3: 00007FF610EF7F32 v8::internal::wasm::WasmCode::safepoint_table_offset+68370
 4: 00007FF610EE0A6D v8::base::CPU::has_fpu+39565
 5: 00007FF610EE2029 v8::base::CPU::has_fpu+45129
 6: 00007FF610FCBB27 uv_timer_stop+1207
 7: 00007FF610FC80CB uv_async_send+331
 8: 00007FF610FC785C uv_loop_init+1292
 9: 00007FF610FC79FA uv_run+202
10: 00007FF610F968A5 node::SpinEventLoop+309
11: 00007FF610E2CE40 v8::internal::wasm::SignatureMap::Freeze+35904
12: 00007FF610E28568 v8::internal::wasm::SignatureMap::Freeze+17256
13: 00007FF610FB80CD uv_poll_stop+557
14: 00007FF611F71600 v8::internal::compiler::ToString+145936
15: 00007FFB74404ED0 BaseThreadInitThunk+16
16: 00007FFB75FCE39B RtlUserThreadStart+43

Build links

test-worker-fshandles-open-close-on-termination:

https://ci.nodejs.org/job/node-test-commit-osx/nodes=osx11-x64/45661/testReport/junit/(root)/test/parallel_test_worker_fshandles_open_close_on_termination/

test-worker-fshandles-error-on-termination:

https://ci.nodejs.org/job/node-test-binary-windows-js-suites/15222/RUN_SUBSET=3,nodes=win10-COMPILED_BY-vs2019/testReport/junit/(root)/test/parallel_test_worker_fshandles_error_on_termination/

Additional information

Related pr: #42910

This pr landed on 2022-07-18, modify src/node_file-inl.h file and add test-worker-fshandles-open-close-on-termination and test-worker-fshandles-error-on-termination.

cc @santigimeno @aduh95 @RaisinTen

@F3n67u F3n67u added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Jun 20, 2022
@F3n67u F3n67u changed the title Investigate test-worker-fshandles-open-close-on-termination crash Investigate test-worker-fshandles-open-close-on-termination and test-worker-fshandles-error-on-termination crash Jun 20, 2022
santigimeno added a commit to santigimeno/node that referenced this issue Jun 22, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by using the same checks everywhere.

Fixes: nodejs#43499
santigimeno added a commit to santigimeno/node that referenced this issue Jun 22, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by using the same checks everywhere.

Fixes: nodejs#43499
santigimeno added a commit to santigimeno/node that referenced this issue Jun 22, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by checking with
`can_call_into_js()` also in the destructor.

Fixes: nodejs#43499
@F3n67u F3n67u reopened this Jul 10, 2022
@RaisinTen
Copy link
Contributor

There seems to be something wrong with the stack trace in https://ci.nodejs.org/job/node-test-binary-windows-js-suites/15541/RUN_SUBSET=3,nodes=win10-COMPILED_BY-vs2019/testReport/junit/(root)/test/parallel_test_worker_fshandles_open_close_on_termination:

01:38:28 not ok 775 parallel/test-worker-fshandles-open-close-on-termination
01:38:28   ---
01:38:28   duration_ms: 3.528
01:38:28   severity: fail
01:38:28   exitcode: 134
01:38:28   stack: |-
01:38:28     C:\Windows\SYSTEM32\cmd.exe[8708]: C:\workspace\node-compile-windows\node\src\node_file-inl.h:162: Assertion `finished_' failed.
01:38:28      1: 00007FF65494964F node_api_throw_syntax_error+176223
01:38:28      2: 00007FF6548D74E6 SSL_get_quiet_shutdown+67398
01:38:28      3: 00007FF6548D78B2 SSL_get_quiet_shutdown+68370
01:38:28      4: 00007FF6548C03ED v8::base::CPU::has_fpu+38813
01:38:28      5: 00007FF6548C19A9 v8::base::CPU::has_fpu+44377
01:38:28      6: 00007FF6549AB4A7 uv_timer_stop+1207
01:38:28      7: 00007FF6549A7A4B uv_async_send+331
01:38:28      8: 00007FF6549A71DC uv_loop_init+1292
01:38:28      9: 00007FF6549A737A uv_run+202
01:38:28     10: 00007FF654976225 node::SpinEventLoop+309
01:38:28     11: 00007FF65480CCC0 v8::internal::interpreter::BytecodeLabel::bind+35904
01:38:28     12: 00007FF6548083E8 v8::internal::interpreter::BytecodeLabel::bind+17256
01:38:28     13: 00007FF654997A4D uv_poll_stop+557
01:38:28     14: 00007FF655951D70 v8::internal::compiler::ToString+145936
01:38:28     15: 00007FFB74404ED0 BaseThreadInitThunk+16
01:38:28     16: 00007FFB75FCE39B RtlUserThreadStart+43
01:38:28   ...

Not sure how node_api_throw_syntax_error got in. The same thing happened with the stack trace in the Windows crash reported in #43084 which is about a very similar test (crash during a Worker termination).

targos pushed a commit that referenced this issue Jul 12, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by checking with
`can_call_into_js()` also in the destructor.

Fixes: #43499

PR-URL: #43533
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
targos pushed a commit that referenced this issue Jul 20, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by checking with
`can_call_into_js()` also in the destructor.

Fixes: #43499

PR-URL: #43533
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
targos pushed a commit that referenced this issue Jul 31, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by checking with
`can_call_into_js()` also in the destructor.

Fixes: #43499

PR-URL: #43533
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
kvakil added a commit to kvakil/node that referenced this issue Aug 2, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: nodejs#43499
Refs: nodejs#43084
nodejs-github-bot pushed a commit that referenced this issue Aug 5, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
danielleadams pushed a commit that referenced this issue Aug 16, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
ruyadorno pushed a commit that referenced this issue Aug 23, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
targos pushed a commit that referenced this issue Sep 5, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
@aduh95 aduh95 reopened this Sep 11, 2022
Fyko pushed a commit to Fyko/node that referenced this issue Sep 15, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: nodejs#43499
Refs: nodejs#43084
PR-URL: nodejs#44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
guangwong pushed a commit to noslate-project/node that referenced this issue Oct 10, 2022
We are deciding whether to end `fs` promises by checking
`can_call_into_js()` whereas in the `FSReqPromise` destructor we're
using the `is_stopping()` check. Though this may look as semantically
correct it has issues because though both values are modified before
termination on `Environment::ExitEnv()` and both are atomic they are not
syncronized together so it may happen that when reaching the destructor
`call_into_js` may be set to `false` whereas `is_stopping` remains
`false` causing the crash. Fix this by checking with
`can_call_into_js()` also in the destructor.

Fixes: nodejs/node#43499

PR-URL: nodejs/node#43533
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
juanarbol pushed a commit that referenced this issue Oct 10, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
juanarbol pushed a commit that referenced this issue Oct 11, 2022
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: #43499
Refs: #43084
PR-URL: #44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
@ywave620
Copy link
Contributor

ywave620 commented Oct 14, 2022

@aduh95 Realized that you reopen this issue, did you find any recent failure? :)

guangwong pushed a commit to noslate-project/node that referenced this issue Jan 3, 2023
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: nodejs/node#43499
Refs: nodejs/node#43084
PR-URL: nodejs/node#44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
guangwong pushed a commit to noslate-project/node that referenced this issue Jan 3, 2023
These tests seem to timeout quite often. I don't know why, but one
possible reason is that they are starting a lot of threads. It seems
that tests in `test/parallel` are assumed to only start one thread each,
so having 11 threads running at a time feels like a lot.

It also seems that these tests fail in a correlated fashion: take a look
at [this reliability report][]. The failures all occur on the same build
machines on the same PRs. This suggests to me some sort of CPU
contention.

[this reliability report]: nodejs/reliability#334

On my Linux machine decreasing the parallelism & iterations here reduce
the `user` time from ~11.5 seconds to ~2 seconds, depending on the test.
I have seen these tests take 30-60 seconds on CI (Alpine in particular).

I went back to the diffs that introduced that introduced these changes
and verified that they failed at least 90% of the time with the reduced
iteration count, which feels sufficient.

Refs: nodejs/node#43499
Refs: nodejs/node#43084
PR-URL: nodejs/node#44090
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
@lpinca
Copy link
Member

lpinca commented Jun 29, 2023

Ping @aduh95

@lpinca lpinca closed this as completed Sep 17, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants