Skip to content

Monitoring and Debugging

Martin Thompson edited this page Apr 5, 2019 · 78 revisions

Debugging and monitoring distributed systems can be a challenge. Aeron has been designed in an open fashion so that much of its internal state can be observed during operation. By taking an open approach we hope to simplify this challenge. If you have suggestions for improvements then would would love to hear them. We try to accommodate suggestions that will benefit the Aeron user community provided they do not violate the Design Principles. Suggestions can be posted to the Issues or discussed in the Gitter room. Some of the tools described below can be found in the samples module. These tools tend to be quite simple and can be easily customised.

Scripts are also available in the samples module to make using these tools a little easier.

  1. Errors
  2. System and Position Counters
  3. Log Inspection
  4. Debug Logging
  5. Loss Reporting

Errors

Rather than take the approach of using log files, which can fill disks due to chronic issues, Aeron records errors to a section of its CnC (Command and Control) file as distinct errors with a count of observations plus time of first and last observation. This means that when the same error is experienced many times only the count and latest observation timestamp is updated. In the unlikely event of this distinct error log in the CnC file filling then further errors are sent to STDERR. The amount of space required for distinct errors is typically not very large and can be configured with the following system property to the Media Driver. Remember only distinct errors are recorded in full.

aeron.error.buffer.length=<default is 1MB>

If errors are present in the distinct error log when the driver starts then they will be copied to a file of the name <timestamp>-error.log in text format. The log is cleared down on each invocation of the Media Driver.

The command line tool ErrorStat can be used to read the error log at any time during Media Driver operation, or even after the Media Driver has been shutdown and the CnC file is still in existence.

$ java -cp aeron-samples/build/libs/samples.jar \
  [-Daeron.dir=<path to aeron directory>] \
  io.aeron.samples.ErrorStat

Counters

Aeron tracks a lot of its internal state as AtomicCounters in a memory mapped file. This file can be read by any external process with no significant impact on performance. These counters are divided into two groups.

  1. System Counters: Counters of significant events observed in the system such as errors, counts, rates, and hints that further investigation should be taken.
  2. Stream Counters: Counters for tracking and limiting progress on byte streams of messages.

Counters can be inspected for any Media Driver with the AeronStat command line tool:

$ java -cp aeron-samples/build/libs/samples.jar \
  [-Daeron.dir=<path to aeron directory>] \
  io.aeron.samples.AeronStat [filter options]

AeronStat will run continuously and output the counters once per second. The default is to output all counters if no filter criteria is provided.

# All system counters
$ java -cp aeron-samples/build/libs/samples.jar \
  io.aeron.samples.AeronStat type=0

# For just the count of errors
$ java -cp aeron-samples/build/libs/samples.jar \
  io.aeron.samples.AeronStat type=0 identity=15

The filter criteria options are:

  1. type: This is the type id of the counter; 0 is system counters; 1-4 are some of the stream counters.
  2. identity: The key which identifies the counter within its type scope.
  3. session: Session id to be used with position counters.
  4. stream: Stream id to be used with position counters.
  5. channel: Channel to be used with position counters.

The filter criteria are regular expressions which can be useful for filtering out a range of streams.

# All position counters
$ java -cp aeron-samples/build/libs/samples.jar \
  io.aeron.samples.AeronStat type=[1-4]

# The counters for a specific stream
$ java -cp aeron-samples/build/libs/samples.jar \
  io.aeron.samples.AeronStat type=[1-4] \
  session=123456 stream=10 'channel=aeron:udp\?endpoint=localhost:40123'

Note: Remember to escape special characters in the filter criteria, e.g. the "?" character in the channel filter.

To get a rolled up view of streams with associated counters all on one line then you can use the StreamStat tool:

$ java -cp aeron-samples/build/libs/samples.jar \
  [-Daeron.dir=<path to aeron directory>] \
  io.aeron.samples.StreamStat

Log Inspection

It is possible to inspect the contents of either a publication log buffer on the producer side of a connection, or the rebuilt image on the consumer side, with the LogInspector command line tool. This tool takes two arguments. First is the filename of the log to be inspected. The second optional argument is a limit in number of bytes to be dumped for the body of each message in the log. Additionally, the following system properties can be provided to customise the output:

  • aeron.log.inspector.data.format: configures the output format for the body of each message which defaults to hex. The other supported value is ascii which can be useful when you know the message contents are strings.
  • aeron.log.inspector.skipDefaultHeader: is a boolean flag indicating if the default header output should be skipped, defaults to false. Valid values are true and false.
  • aeron.log.inspector.scanOverZeroes: should the inspector skip of zeros in the file, defaults to false. Useful for scanning a log that joined late or experienced loss.
# Dump the contents of a log with up to 200 bytes of each message hex.
$ java -cp aeron-samples/build/libs/samples.jar \
  io.aeron.samples.LogInspector \
  /dev/shm/aeron-mjpt777/publications/<filename>.logbuffer 200 > dump.txt

# Dump the contents of a log with up to 50 bytes of each message in ASCII.
$ java -cp aeron-samples/build/libs/samples.jar \
  io.aeron.samples.LogInspector \
  -Daeron.log.inspector.data.format=ascii \
  /dev/shm/aeron-mjpt777/images/<filename>.logbuffer 50 > dump.txt

This also works for archive segment files:

# Dump the contents of a segment file with up to 50 bytes of each message in ASCII.
$ java -cp aeron-samples/build/libs/samples.jar \
  io.aeron.samples.archive.SegmentInspector \
  -Daeron.log.inspector.data.format=ascii \
  <segment-filename>.rec 50 > dump.txt

Debug Logging

Aeron does not take the common approach of littering a code base with logging statements to aid debugging. Instead logging statements are dynamically woven into a running system via a Java Agent. This allows for byte code weaving to dynamically add logging statements into the code where required. An example Java for adding logging can be found in the aeron-agent module.

The event logging agent can be added to the Media Driver on start up as follows, or as in this script:

$ java -cp aeron-samples/build/libs/samples.jar \
  -javaagent:aeron-agent/build/libs/aeron-agent-<version>-all.jar \
  -Daeron.event.log=admin,FRAME_IN \
  io.aeron.driver.MediaDriver
  • aeron.event.log.filename: System property for the file to which the log is appended. If not set then STDOUT will be used. Logging to file is significantly more efficient than logging to STDOUT.
  • aeron.event.buffer.length: System property for length of the in-memory buffer used between the capture pointcut and the log reader. Defaults to 2MB.
  • aeron.event.log.reader.classname: System property for the log reader class which consumes the event buffer. Defaults to io.aeron.agent.EventLogReaderAgent.
  • aeron.event.log: System property containing a comma separated list of driver event codes. See below for some details.
  • aeron.event.archive.log: System property containing a comma separated list of archive event codes.
  • aeron.event.cluster.log: System property containing a comma separated list of cluster event codes.

The aeron.event.log system property can contain a comma separated list of event codes which will be logged to STDOUT. The possible values are defined in enums that implement EventCode, those include DriverEventCode, ArchiveEventCode, and ClusterEventCode. There are two groups defined as a shorthand. These are all for all possible events and admin for administration events. An inclusive set is made from the comma separated list. For individual events the enum name or id can be used, e.g. FRAME_IN and 1 map to the same event code.

Loss Reporting

When loss is detected at the receiver side it is logged to the Loss Report (loss-report.dat) in the Aeron driver directory. The log contains an aggregate entry reporting the loss by stream for the number of times loss was observed, total bytes lost, time of first observation, time of last observation, and the details for the stream. This report can be read with the LossStat tool or by parsing the file as the format is published. The LossStat tool will output to STDOUT in a CSV format for storage and later analysis.

$ java -cp aeron-samples/build/libs/samples.jar \
  io.aeron.samples.LossStat