You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Artio Hangs Until Own Heartbeat After Resend Request
Environment
Tested Artio versions: 0.162 and 0.164
Platform: MacOS 15.3
Java: JDK 21
While trying to debug issue #544, we came across a sequence of connections and messages that triggers presumably incorrect, possibly related behavior in Artio.
Reproduction Steps
Here is the logic, implemented in OCaml, we used to trigger this scenario:
First connection (connect, order, and disconnect):
We connect to Artio, exchange logons with a sequence number reset and heartbeat interval of 30s.
We send a new order and ignore the reply (order confirmation ER).
We exchange logouts and close the connection.
Note that we keep one second intervals between Artio connections to prevent Artio from sending resend requests.
Second connection (confuse Artio with resend requests):
We connect to Artio and exchange logons with a heartbeat interval of 30s but without sequence number reset.
We send 100 resend requests for all messages starting from the previously received execution report. Note that these are four messages, first the ER, then a sequence-reset-gapfill for the logout, then an ER for the canceled order, and then another gapfill for the new logon. We do, however, only read and ignore one of these messages after each resend request. This seems necessary to trigger the bug. This may relate to buffering.
We close the connection.
We keep iterating these two kinds of connections since the problem is not quite deterministic, but it typically happens already on the second iteration.
Observed Behavior
The first connection always behaves exactly as expected.
On the first iteration, the second connection will flood Artio with resend requests. Eventually, Artio will emit the following error, which may be necessary for the scenario to unfold:
The Artio FIX replies will cycle over the four requested messages. Here is a partial log for OCaml client CGXSCGXSUS01 and Artio server CGXS:
FIX Log of First Connection
----------------- CONNECT, ORDER, AND DISCONNECT -----------------
8=FIX.4.2|9=77|35=A|34=1|52=20250204-22:26:49.987|49=CGXSCGXSUS01|56=CGXS|98=0|108=30|141=Y|10=231|
8=FIX.4.2|9=77|35=A|49=CGXS|56=CGXSCGXSUS01|34=1|52=20250204-22:26:50.057|98=0|108=30|141=Y|10=211|
8=FIX.4.2|9=156|35=D|34=2|52=20250204-22:26:50.001|49=CGXSCGXSUS01|56=CGXS|11=ORDER_1|21=1|18=M|110=100|55=BAC|54=1|60=20250204-22:26:50.001|38=100|40=P|44=34.35|59=0|47=A|10=038|
8=FIX.4.2|9=266|35=8|49=CGXS|56=CGXSCGXSUS01|34=2|52=20250204-22:26:50.063|6=0|11=ORDER_1|14=0|17=418d904ee936543fc1fd5ce464c03e6b|18=M|20=0|31=0|32=0|37=5518a5f609268001c1a5e9d95c9f8b30|38=100|39=0|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:26:50.001|110=100|150=0|151=100|10=208|
8=FIX.4.2|9=59|35=5|34=3|52=20250204-22:26:50.006|49=CGXSCGXSUS01|56=CGXS|10=125|
8=FIX.4.2|9=59|35=5|49=CGXS|56=CGXSCGXSUS01|34=3|52=20250204-22:26:50.063|10=128|
------------------------ CONFUSE ------------------------
8=FIX.4.2|9=71|35=A|34=4|52=20250204-22:26:51.011|49=CGXSCGXSUS01|56=CGXS|98=0|108=30|10=154|
8=FIX.4.2|9=77|35=A|49=CGXS|56=CGXSCGXSUS01|34=5|52=20250204-22:26:51.083|98=0|108=30|141=N|10=204|
8=FIX.4.2|9=68|35=2|34=5|52=20250204-22:26:51.026|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=251|
8=FIX.4.2|9=297|35=8|49=CGXS|56=CGXSCGXSUS01|34=2|52=20250204-22:26:51.046|43=Y|122=20250204-22:26:50.063|6=0|11=ORDER_1|14=0|17=418d904ee936543fc1fd5ce464c03e6b|18=M|20=0|31=0|32=0|37=5518a5f609268001c1a5e9d95c9f8b30|38=100|39=0|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:26:50.001|110=100|150=0|151=100|10=207|
8=FIX.4.2|9=68|35=2|34=6|52=20250204-22:26:51.048|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=000|
8=FIX.4.2|9=75|35=4|49=CGXS|56=CGXSCGXSUS01|34=3|43=Y|52=20250204-22:26:51.046|123=Y|36=4|10=133|
8=FIX.4.2|9=68|35=2|34=7|52=20250204-22:26:51.048|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=001|
8=FIX.4.2|9=303|35=8|49=CGXS|56=CGXSCGXSUS01|34=4|52=20250204-22:26:51.046|43=Y|122=20250204-22:26:50.666|6=0|11=ORDER_1|14=0|17=6db07b0fc2f51f6ece0e965329a05598|18=M|20=0|31=0|32=0|37=5518a5f609268001c1a5e9d95c9f8b30|38=100|39=4|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:26:50.604|110=100|150=4|151=0|20007=6|10=235|
8=FIX.4.2|9=68|35=2|34=8|52=20250204-22:26:51.048|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=002|
8=FIX.4.2|9=75|35=4|49=CGXS|56=CGXSCGXSUS01|34=5|43=Y|52=20250204-22:26:51.046|123=Y|36=6|10=137|
8=FIX.4.2|9=68|35=2|34=9|52=20250204-22:26:51.048|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=003|
8=FIX.4.2|9=297|35=8|49=CGXS|56=CGXSCGXSUS01|34=2|52=20250204-22:26:51.063|43=Y|122=20250204-22:26:50.063|6=0|11=ORDER_1|14=0|17=418d904ee936543fc1fd5ce464c03e6b|18=M|20=0|31=0|32=0|37=5518a5f609268001c1a5e9d95c9f8b30|38=100|39=0|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:26:50.001|110=100|150=0|151=100|10=206|
... repeating the resend requests and cycling over the replies ...
After the second connection, we restart with "connect, order, and disconnect", which resets the sequence number, i.e. this should behave like a fresh session. We should expect the same behavior as above, but instead we see this:
FIX Log of Second Connection
----------------- CONNECT, ORDER, AND DISCONNECT -----------------
8=FIX.4.2|9=77|35=A|34=1|52=20250204-22:35:20.932|49=CGXSCGXSUS01|56=CGXS|98=0|108=30|141=Y|10=210|
8=FIX.4.2|9=77|35=A|49=CGXS|56=CGXSCGXSUS01|34=1|52=20250204-22:35:20.949|98=0|108=30|141=Y|10=218|
8=FIX.4.2|9=156|35=D|34=2|52=20250204-22:35:20.947|49=CGXSCGXSUS01|56=CGXS|11=ORDER_1|21=1|18=M|110=100|55=BAC|54=1|60=20250204-22:35:20.947|38=100|40=P|44=34.35|59=0|47=A|10=070|
8=FIX.4.2|9=266|35=8|49=CGXS|56=CGXSCGXSUS01|34=2|52=20250204-22:35:20.955|6=0|11=ORDER_1|14=0|17=b944e364cb038ccb173f730795e540c8|18=M|20=0|31=0|32=0|37=540185fc4250b94fc93f28225e291578|38=100|39=0|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:35:20.947|110=100|150=0|151=100|10=042|
8=FIX.4.2|9=59|35=5|34=3|52=20250204-22:35:20.955|49=CGXSCGXSUS01|56=CGXS|10=135|
8=FIX.4.2|9=59|35=5|49=CGXS|56=CGXSCGXSUS01|34=3|52=20250204-22:35:20.956|10=136|
------------------------ CONFUSE ------------------------
8=FIX.4.2|9=71|35=A|34=4|52=20250204-22:35:21.956|49=CGXSCGXSUS01|56=CGXS|98=0|108=30|10=169|
8=FIX.4.2|9=77|35=A|49=CGXS|56=CGXSCGXSUS01|34=5|52=20250204-22:35:21.970|98=0|108=30|141=N|10=206|
8=FIX.4.2|9=68|35=2|34=5|52=20250204-22:35:21.969|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=008|
.......... We block here for exactly 24 seconds until we receive a heartbeat ..........
8=FIX.4.2|9=59|35=0|49=CGXS|56=CGXSCGXSUS01|34=6|52=20250204-22:35:45.970|10=137|
8=FIX.4.2|9=68|35=2|34=6|52=20250204-22:35:45.970|49=CGXSCGXSUS01|56=CGXS|7=2|16=0|10=007|
8=FIX.4.2|9=297|35=8|49=CGXS|56=CGXSCGXSUS01|34=2|52=20250204-22:35:21.986|43=Y|122=20250204-22:35:20.955|6=0|11=ORDER_1|14=0|17=b944e364cb038ccb173f730795e540c8|18=M|20=0|31=0|32=0|37=540185fc4250b94fc93f28225e291578|38=100|39=0|40=P|44=34.35|47=A|54=1|55=BAC|59=0|60=20250204-22:35:20.947|110=100|150=0|151=100|10=051|
Note that the requested messages for the first resend request only arrive when Artio gets "unstuck" by its own heartbeat after 24 seconds. But the subsequent resend requests were honored immediately just fine again. The 24 seconds gap corresponds to 80% of the heartbeat interval. If we change the heartbeat interval, this 80% ratio will be maintained. E.g. if we set it to 200 seconds, it will take 160 seconds to get unstuck.
Expected Behavior
A new connection followed by a successful logon with sequence number reset should clear out any previous issues. It should not be the case that a perfectly normal resend request after that point would apparently only get back to normal operation after an Artio heartbeat.
Notes
Here are some general observations:
The bug trigger is quite brittle in the sense that small modifications to the logic may fail to cause the issue.
It seems of particular interest that heartbeats sent by Artio get it back into a responsive state.
The bug may require "slow" reading of the incoming messages by the client. Not reading them at all did not trigger the issue, nor did reading all four resent messages immediately after each request. This may be due to unrelated reasons, but might also indicate issues related to buffering.
The text was updated successfully, but these errors were encountered:
Artio Hangs Until Own Heartbeat After Resend Request
Environment
While trying to debug issue #544, we came across a sequence of connections and messages that triggers presumably incorrect, possibly related behavior in Artio.
Reproduction Steps
Here is the logic, implemented in OCaml, we used to trigger this scenario:
OCaml Client Test Code
Here is a description in words:
First connection (connect, order, and disconnect):
Note that we keep one second intervals between Artio connections to prevent Artio from sending resend requests.
Second connection (confuse Artio with resend requests):
We keep iterating these two kinds of connections since the problem is not quite deterministic, but it typically happens already on the second iteration.
Observed Behavior
The first connection always behaves exactly as expected.
On the first iteration, the second connection will flood Artio with resend requests. Eventually, Artio will emit the following error, which may be necessary for the scenario to unfold:
Artio Error Log
The Artio FIX replies will cycle over the four requested messages. Here is a partial log for OCaml client
CGXSCGXSUS01
and Artio serverCGXS
:FIX Log of First Connection
After the second connection, we restart with "connect, order, and disconnect", which resets the sequence number, i.e. this should behave like a fresh session. We should expect the same behavior as above, but instead we see this:
FIX Log of Second Connection
Note that the requested messages for the first resend request only arrive when Artio gets "unstuck" by its own heartbeat after 24 seconds. But the subsequent resend requests were honored immediately just fine again. The 24 seconds gap corresponds to 80% of the heartbeat interval. If we change the heartbeat interval, this 80% ratio will be maintained. E.g. if we set it to 200 seconds, it will take 160 seconds to get unstuck.
Expected Behavior
Notes
Here are some general observations:
The text was updated successfully, but these errors were encountered: