Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Artio Hangs Until Own Heartbeat After Resend Request #545

Open
mmottl opened this issue Feb 5, 2025 · 0 comments
Open

Artio Hangs Until Own Heartbeat After Resend Request #545

mmottl opened this issue Feb 5, 2025 · 0 comments

Comments

@mmottl
Copy link

mmottl commented Feb 5, 2025

Artio Hangs Until Own Heartbeat After Resend Request

Environment

  • Tested Artio versions: 0.162 and 0.164
  • Platform: MacOS 15.3
  • Java: JDK 21

While trying to debug issue #544, we came across a sequence of connections and messages that triggers presumably incorrect, possibly related behavior in Artio.

Reproduction Steps

Here is the logic, implemented in OCaml, we used to trigger this scenario:

OCaml Client Test Code
let connect_order_and_disconnect () =
  printf "----------------- CONNECT, ORDER, AND DISCONNECT -----------------\n";
  protectx (open_connection ()) ~finally:Conn.close ~f:(fun conn ->
      (* Logon *)
      let logon =
        create_logon ~reset_seq_num_flag:Field.yes ~heart_bt_int:30 ()
      in
      Conn.send_msg ~log conn logon;
      Conn.ignore_msg ~log conn;

      (* New order *)
      let order = create_new_order ~cl_ord_id:1 () in
      Conn.send_msg ~log conn order;
      Conn.ignore_msg ~log conn;

      (* Logout *)
      Conn.send_msg ~log conn (create_logout ());
      Conn.ignore_msg ~log conn)

let confuse () =
  printf "------------------------ CONFUSE ------------------------\n";
  protectx (open_connection ()) ~finally:Conn.close ~f:(fun conn ->
      (* Logon *)
      let logon = create_logon ~heart_bt_int:30 () in
      Conn.send_msg ~log conn logon;
      Conn.ignore_msgs ~log conn 1;

      for _i = 1 to 100 do
        let resend_request =
          create_resend_request ~begin_seq_no:2 ~end_seq_no:0 ()
        in
        Conn.send_msg ~log conn resend_request;
        Conn.ignore_msg ~log conn
      done)

let () =
  while true do
    Utils.sleep 1.0;
    connect_order_and_disconnect ();
    Utils.sleep 1.0;
    confuse ()
  done

Here is a description in words:

First connection (connect, order, and disconnect):

  1. We connect to Artio, exchange logons with a sequence number reset and heartbeat interval of 30s.
  2. We send a new order and ignore the reply (order confirmation ER).
  3. We exchange logouts and close the connection.

Note that we keep one second intervals between Artio connections to prevent Artio from sending resend requests.

Second connection (confuse Artio with resend requests):

  1. We connect to Artio and exchange logons with a heartbeat interval of 30s but without sequence number reset.
  2. We send 100 resend requests for all messages starting from the previously received execution report. Note that these are four messages, first the ER, then a sequence-reset-gapfill for the logout, then an ER for the canceled order, and then another gapfill for the new logon. We do, however, only read and ignore one of these messages after each resend request. This seems necessary to trigger the bug. This may relate to buffering.
  3. We close the connection.

We keep iterating these two kinds of connections since the problem is not quite deterministic, but it typically happens already on the second iteration.

Observed Behavior

  • The first connection always behaves exactly as expected.

  • On the first iteration, the second connection will flood Artio with resend requests. Eventually, Artio will emit the following error, which may be necessary for the scenario to unfold:

    Artio Error Log
    uk.co.real_logic.artio.fixgatewayexception: ignore resend request for sessionid=1,connectionid=2301866843017191459 as 5 requests in flight
        at uk.co.real_logic.artio.engine.logger.replayer.onresendrequest(replayer.java:292)
        at uk.co.real_logic.artio.engine.logger.replayer.onfragment(replayer.java:208)
        at io.aeron.image.controlledpoll(image.java:385)
        at io.aeron.subscription.controlledpoll(subscription.java:235)
        at uk.co.real_logic.artio.engine.logger.replayer.dowork(replayer.java:478)
        at org.agrona.concurrent.compositeagent.dowork(compositeagent.java:120)
        at org.agrona.concurrent.agentrunner.dowork(agentrunner.java:304)
        at org.agrona.concurrent.agentrunner.workloop(agentrunner.java:296)
        at org.agrona.concurrent.agentrunner.run(agentrunner.java:162)
        at java.base/java.lang.thread.run(thread.java:1583)

    The Artio FIX replies will cycle over the four requested messages. Here is a partial log for OCaml client CGXSCGXSUS01 and Artio server CGXS:

    FIX Log of First Connection
    ----------------- CONNECT, ORDER, AND DISCONNECT -----------------
    8=FIX.4.2|9=77|35=A|34=1|52=20250204-22:26:49.987|49=CGXSCGXSUS01|56=CGXS|98=0|108=30|141=Y|10=231|
    8=FIX.4.2|9=77|35=A|49=CGXS|56=CGXSCGXSUS01|34=1|52=20250204-22:26:50.057|98=0|108=30|141=Y|10=211|
    8=FIX.4.2|9=156|35=D|34=2|52=20250204-22:26:50.001|49=CGXSCGXSUS01|56=CGXS|11=ORDER_1|21=1|18=M|110=100|55=BAC|54=1|60=20250204-22:26:50.001|38=100|40=P|44=34.35|59=0|47=A|10=038|
    8=FIX.4.2|9=266|35=8|49=CGXS|56=CGXSCGXSUS01|34=2|52=20250204-22:26:50.063|6=0|11=ORDER_1|14=0|17=418d904ee936543fc1fd5ce464c03e6b|18=M|20=0|31=0|32=0|37=5518a5f609268001c1a5e9d95c9f8b30|38=100|39=0|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:26:50.001|110=100|150=0|151=100|10=208|
    8=FIX.4.2|9=59|35=5|34=3|52=20250204-22:26:50.006|49=CGXSCGXSUS01|56=CGXS|10=125|
    8=FIX.4.2|9=59|35=5|49=CGXS|56=CGXSCGXSUS01|34=3|52=20250204-22:26:50.063|10=128|
    ------------------------ CONFUSE ------------------------
    8=FIX.4.2|9=71|35=A|34=4|52=20250204-22:26:51.011|49=CGXSCGXSUS01|56=CGXS|98=0|108=30|10=154|
    8=FIX.4.2|9=77|35=A|49=CGXS|56=CGXSCGXSUS01|34=5|52=20250204-22:26:51.083|98=0|108=30|141=N|10=204|
    8=FIX.4.2|9=68|35=2|34=5|52=20250204-22:26:51.026|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=251|
    8=FIX.4.2|9=297|35=8|49=CGXS|56=CGXSCGXSUS01|34=2|52=20250204-22:26:51.046|43=Y|122=20250204-22:26:50.063|6=0|11=ORDER_1|14=0|17=418d904ee936543fc1fd5ce464c03e6b|18=M|20=0|31=0|32=0|37=5518a5f609268001c1a5e9d95c9f8b30|38=100|39=0|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:26:50.001|110=100|150=0|151=100|10=207|
    8=FIX.4.2|9=68|35=2|34=6|52=20250204-22:26:51.048|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=000|
    8=FIX.4.2|9=75|35=4|49=CGXS|56=CGXSCGXSUS01|34=3|43=Y|52=20250204-22:26:51.046|123=Y|36=4|10=133|
    8=FIX.4.2|9=68|35=2|34=7|52=20250204-22:26:51.048|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=001|
    8=FIX.4.2|9=303|35=8|49=CGXS|56=CGXSCGXSUS01|34=4|52=20250204-22:26:51.046|43=Y|122=20250204-22:26:50.666|6=0|11=ORDER_1|14=0|17=6db07b0fc2f51f6ece0e965329a05598|18=M|20=0|31=0|32=0|37=5518a5f609268001c1a5e9d95c9f8b30|38=100|39=4|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:26:50.604|110=100|150=4|151=0|20007=6|10=235|
    8=FIX.4.2|9=68|35=2|34=8|52=20250204-22:26:51.048|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=002|
    8=FIX.4.2|9=75|35=4|49=CGXS|56=CGXSCGXSUS01|34=5|43=Y|52=20250204-22:26:51.046|123=Y|36=6|10=137|
    8=FIX.4.2|9=68|35=2|34=9|52=20250204-22:26:51.048|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=003|
    8=FIX.4.2|9=297|35=8|49=CGXS|56=CGXSCGXSUS01|34=2|52=20250204-22:26:51.063|43=Y|122=20250204-22:26:50.063|6=0|11=ORDER_1|14=0|17=418d904ee936543fc1fd5ce464c03e6b|18=M|20=0|31=0|32=0|37=5518a5f609268001c1a5e9d95c9f8b30|38=100|39=0|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:26:50.001|110=100|150=0|151=100|10=206|
    
    ... repeating the resend requests and cycling over the replies ...
    
  • After the second connection, we restart with "connect, order, and disconnect", which resets the sequence number, i.e. this should behave like a fresh session. We should expect the same behavior as above, but instead we see this:

    FIX Log of Second Connection
    ----------------- CONNECT, ORDER, AND DISCONNECT -----------------
    8=FIX.4.2|9=77|35=A|34=1|52=20250204-22:35:20.932|49=CGXSCGXSUS01|56=CGXS|98=0|108=30|141=Y|10=210|
    8=FIX.4.2|9=77|35=A|49=CGXS|56=CGXSCGXSUS01|34=1|52=20250204-22:35:20.949|98=0|108=30|141=Y|10=218|
    8=FIX.4.2|9=156|35=D|34=2|52=20250204-22:35:20.947|49=CGXSCGXSUS01|56=CGXS|11=ORDER_1|21=1|18=M|110=100|55=BAC|54=1|60=20250204-22:35:20.947|38=100|40=P|44=34.35|59=0|47=A|10=070|
    8=FIX.4.2|9=266|35=8|49=CGXS|56=CGXSCGXSUS01|34=2|52=20250204-22:35:20.955|6=0|11=ORDER_1|14=0|17=b944e364cb038ccb173f730795e540c8|18=M|20=0|31=0|32=0|37=540185fc4250b94fc93f28225e291578|38=100|39=0|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:35:20.947|110=100|150=0|151=100|10=042|
    8=FIX.4.2|9=59|35=5|34=3|52=20250204-22:35:20.955|49=CGXSCGXSUS01|56=CGXS|10=135|
    8=FIX.4.2|9=59|35=5|49=CGXS|56=CGXSCGXSUS01|34=3|52=20250204-22:35:20.956|10=136|
    ------------------------ CONFUSE ------------------------
    8=FIX.4.2|9=71|35=A|34=4|52=20250204-22:35:21.956|49=CGXSCGXSUS01|56=CGXS|98=0|108=30|10=169|
    8=FIX.4.2|9=77|35=A|49=CGXS|56=CGXSCGXSUS01|34=5|52=20250204-22:35:21.970|98=0|108=30|141=N|10=206|
    8=FIX.4.2|9=68|35=2|34=5|52=20250204-22:35:21.969|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=008|
    
    .......... We block here for exactly 24 seconds until we receive a heartbeat ..........
    
    8=FIX.4.2|9=59|35=0|49=CGXS|56=CGXSCGXSUS01|34=6|52=20250204-22:35:45.970|10=137|
    8=FIX.4.2|9=68|35=2|34=6|52=20250204-22:35:45.970|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=007|
    8=FIX.4.2|9=297|35=8|49=CGXS|56=CGXSCGXSUS01|34=2|52=20250204-22:35:21.986|43=Y|122=20250204-22:35:20.955|6=0|11=ORDER_1|14=0|17=b944e364cb038ccb173f730795e540c8|18=M|20=0|31=0|32=0|37=540185fc4250b94fc93f28225e291578|38=100|39=0|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:35:20.947|110=100|150=0|151=100|10=051|
    

    Note that the requested messages for the first resend request only arrive when Artio gets "unstuck" by its own heartbeat after 24 seconds. But the subsequent resend requests were honored immediately just fine again. The 24 seconds gap corresponds to 80% of the heartbeat interval. If we change the heartbeat interval, this 80% ratio will be maintained. E.g. if we set it to 200 seconds, it will take 160 seconds to get unstuck.

Expected Behavior

  • A new connection followed by a successful logon with sequence number reset should clear out any previous issues. It should not be the case that a perfectly normal resend request after that point would apparently only get back to normal operation after an Artio heartbeat.

Notes

Here are some general observations:

  • The bug trigger is quite brittle in the sense that small modifications to the logic may fail to cause the issue.
  • It seems of particular interest that heartbeats sent by Artio get it back into a responsive state.
  • The bug may require "slow" reading of the incoming messages by the client. Not reading them at all did not trigger the issue, nor did reading all four resent messages immediately after each request. This may be due to unrelated reasons, but might also indicate issues related to buffering.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant