Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ipc_transport_structured: Investigate: Msg_out - move ctor impl mystery (past instability). #20

Open
ygoldfeld opened this issue Mar 12, 2024 · 0 comments
Labels
from-akamai-pre-open Issue origin is Akamai, before opening source question Further information is requested

Comments

@ygoldfeld
Copy link
Contributor

Filed by @ygoldfeld pre-open-source:

Description follows. Environment (the only one, with many others tried!) where problem was actually observed:

Linux 5.4.... machine
CMake-built open-source ipc meta-project
clang-17 installed via apt-get = /usr/bin/clang => clang due to PATH
libc++ (LLVM-10) installed (as opposed to using GNU stdc++, which is also installed but not used here) via apt-get
compiler selection:
-DCMAKE_CXX_COMPILER=clang++
compiler flags:
-DCMAKE_CXX_FLAGS_RELEASE='-stdlib=libc++ -I/usr/lib/llvm-10/include/c++/v1 -O3 -DNDEBUG -fno-omit-frame-pointer'
linker flags:
-DCMAKE_EXE_LINKER_FLAGS_RELEASE="-L/usr/lib/llvm-10/lib -lc++ -lc++abi -fno-omit-frame-pointer -O3"
-DCMAKE_BUILD_TYPE=Release
Plus CMake settings:
CFG_ENABLE_TEST_SUITE:BOOL=ON
CFG_NO_LTO:BOOL=OFF <-- Important
Lastly (important):
Problem reproduced without ASAN (-fsanitize=address in CMAKE_*_FLAGS_RELEASE)
Problem not reproduced with it (+ ASAN detects no errors)

Sufficient to build just libipc_unit_test.exec: make -j32 libipcunit_test.exec
then go (from build dir) to test/suite/unit_test
and run the test in question: ./libipc_unit_test.exec --gtest_filter=Shm_session_test.In_process_array

Expected behavior: it passes; no seg-fault. Observed behavior as-is: it passes; no seg-fault.

The thing to investigate:

  • Edit ipc_transport_structured/src/ipc/transport/struc/msg.hpp.
  • Find Msg_out move ctor impl.
  • Replace its ~4 line + long comment impl (where it invokes the move-assignment operator) with 1-line impl:
    = default;
  • Build libipc_unit_test.exec as above again.
  • Run it as above again.

Expected behavior: it passes; no seg-fault. Observed behavior (with the above code change): it seg-faults in the move ctor. (In backtrace seen by gdb libipc_unit_test.exec core it is shown as being within Client_session_impl::mdt_builder() due to inlining.)

(The mdt_builder() code tickling the problem is this:

Master_channel_req_ptr req_ptr(new Master_channel_req{ mdt_root, std::move(req_msg) });

The moving of Msg_out req_msg into the Master_channel_req member => seg-fault.)

--

Now as to WTF this is all about, the comment in the code should explain it.

TEMPLATE_STRUCT_MSG_OUT
CLASS_STRUCT_MSG_OUT::Msg_out(Msg_out&& src) :
  Msg_out()
{
  operator=(std::move(src));

  /* (This comment is by ygoldfel, original author of this code -- here and all of this class template and file.)
   * This implementation is clearly correct (functionally); one can see a similar pattern used in many places in,
   * e.g., STL/Boost.  However the original, and all else being equal (which it's not; read on), my preferred
   * implementation of this move ctor is simply: `= default;`.  In that original state I tested this guy and
   * ran into the following.
   *
   * As of this writing, to test this (and all of Msg_out and much else besides), I've been using the
   * transport_test.exec exercise-mode section; and the unit_test.exec test suite (both in the `ipc` meta-project,
   * though most of the relevant .../test/... code for unit_test.exec is in `ipc_shm_arena_lend` project therein).
   * (Normally we don't speak of test code inside source code proper like this, but read on to see why the
   * exception.)  transport_test has passed in all configurations (build/run envs) without issue.
   * unit_test.exec has passed in at least the following (all Linux) except the *asterisked* one:
   *   - gcc-9, -O0 or: (-O3 + LTO disabled or LTO (with fat-object-generation on) enabled (for all of libflow
   *     and libipc_* and the test code proper)).
   *   - clang-17 + libc++ (LLVM-10) (note: not GNU stdc++), -O0 or: (-O3 + LTO disabled or LTO (-flto=thin) enabled*
   *     (for all of libflow and libipc_* and the test code proper)).
   *
   * The *asterisk* there denotes the one config where a problem was observed.  Namely,
   * Shm_session_test.In_process_array unit_test failed, seg-faulting before the test could complete (no actual
   * test-failed assertions had triggered up to the seg-fault point).  The seg-fault was completely consistent
   * (at least given a particular machine): in session::Client_session_impl::mdt_builder(), just near the end
   * of the method, a Msg_out was being move-cted (presumably using this ctor here) from another Msg_out
   * that had just been constructed a few lines above.  Granted, that source Msg_out was constructed and
   * populated (rather straightforwardly) in another thread, but a synchronization mechanism was used to ensure
   * this happened synchronously before the code that seg-faulted (this move-ctor) would proceed to be called.
   * (I verified quite studiously that nothing untoward was happening there; clearly the source object was
   * created cleanly before signaling the thread waiting to proceed with this move-ct to quit waiting and proceed.)
   * (In addition note that this Msg_out ctor was, as seen in debugger, clearly getting auto-inlined; since the prob
   * occurred with LTO but not without -- though gcc's LTO had no issue, but I digress -- this may be significant.)
   *
   * First I labored to isolate where the problem occurred; there was no corrupted memory or anything strange like
   * that; and it fairly clearly really was in the move ctor (which, again, was just `= default;` at the time).
   * Since the move ctor was auto-generated, it was somewhat challenging to see where the problem happened, though
   * admittedly due to time pressure I did not delve into the move ctors *that* would've invoked: for
   * m_builder, m_body_root, m_hndl_or_null (doing so might be a good idea; but keep reading).
   *
   * At that point, just to see what's up, I "jiggled" the implementation into its current form.  Empirically
   * speaking the problem went away, and everything passed swimmingly with no instability.  Hence I left the
   * work-around in place.  So where does it leave us?
   *
   * 1, in terms of correctness and generated-code quality: As noted, the new impl is clearly correct.  In terms of
   * generated-code quality, it is potentially a few instructions slower than the auto-generated ctor:
   * the three m_* are first initialized to their null states and then move-assigned; plus
   * store_native_handle_or_null() checks m_hndl_or_null for null (which it is, so that `if` no-ops).
   * It's simple/quick stuff, and it might even get optimized away with -O3, but nevertheless skipping to
   * move-ction of the members would be more certain to not do that unneeded zeroing stuff.  Subjectively: it's
   * probably fine (undoubtedly there are far heavier perf concerns than a few extra zeroings).
   *
   * 2, there's the mystery.  I.e., sure, the replacement code is fine and is unlikely to be a perf problem;
   * but why isn't the preferred `= default;` the one in actual use?  Why did it cause the failure, though only
   * in a very particular build/run environment (clang-17, LLVM-10 libc++, with thin-LTO), when similar ones
   * (such as same but without LTO) were fine?  At this stage I honestly do not know and will surely file a
   * ticket to investigate/solve.  Briefly the culprit candidates are:
   *   - Msg_out code itself.  Is there some unseen uninitialized datum?  Some basic assumption being ignored or
   *     C++ rule being broken? ###
   *   - Something in capnp-generated move-ctor (as of this writing capnp version = 1.0.1, from late 2023) or
   *     m_builder move-ctor.
   *   - Something in clang-17 + thin-LTO optimizer improperly reordering instructions.
   *
   * I cannot speculate much about which it is; can only say after a few hours of contemplating possibilities:
   * no candidates for ### (see above) being at fault has jumped out at me.  That said, no run-time sanitizer
   * has run through this code as of this writing (though the code analyzer, Coverity, has not complained);
   * that could help.  Sanitizer or not, generally:
   * given 0-2 days of investigation by an experienced person surely we can narrow this down to a
   * minimal repro case, etc. etc.  So it is solvable: just needs to be done.
   *
   * Until then: (1) this remains a mystery; and (2) there is an acceptable work-around.  (Though the mystery
   * is personally disconcerting to me; e.g., as of this writing, I cannot name another such mystery in this
   * entire code base.) */
} // CLASS_STRUCT_MSG_OUT::Msg_out(Msg_out&&)

Please note: ASAN passes, as of this writing, throughout this unit test and all known tests (unit_test ones or transport_test ones or otherwise). It "feels" like there could be a gremlin from another thread, but further experiments and work seem to suggest otherwise. So one way or another it's a matter of investigating as noted in the above comment.

(Aside: It's true that as of the time of that writing no sanitizer had run from the code. As pointed out here, though, since then it absolutely has -- and passed. So the plot thickens a little.)

That said, there are no known functional or correctness problems in the actual committed code -- merely the mystery of why the change had to be made. Still I don't like such mysteries.

@ygoldfeld ygoldfeld added question Further information is requested from-akamai-pre-open Issue origin is Akamai, before opening source labels Mar 12, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
from-akamai-pre-open Issue origin is Akamai, before opening source question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant