Skip to content

Error Handling in Aeron C Code

Michael Barker edited this page Nov 29, 2023 · 7 revisions

In order to make it straightforward to trace the source of errors and track context information when an error occurs Aeron has a specific approach to the handling and reporting of errors. Aeron uses an approach where an error stack is produced so that it is possible to trace the path through the code that lead to the error occurring. This is similar to stack trace that would be produced by an exception in Java, but has some important differences. Firstly it needs to be manually constructed, i.e. each step in the stack needs to be set by code and it also allows for additional context information to be captured at each level in the stack. Being able to append user data to the error stack is useful to allow context information to be included in the error without having to pass it up or down the call tree just for it to be included in the error message.

With the file aeron_err.h there are two macros that are defined, AERON_SET_ERR and AERON_APPEND_ERR. These should be used whenever an error is encountered. There are some rules that should be applied to determine which one to use.

AERON_SET_ERR

The AERON_SET_ERR macro should be used whenever an error is first detected. The two most common use where AERON_SET_ERR should be used are when a function applies some validation to the input and it fails and if an error is detected in some call to an external library, e.g. the c standard library. The format of the AERON_SET_ERR macro is:

AERON_SET_ERR(errcode, fmt, ...)

The first parameter is an error code, this should be a standard POSIX error code, or more specifically a code that can be resolved to an error message via a call to strerror (and similar functions). Note that Windows has some support for errno and strerror, so if you are making a Windows call that sets errno, then use this function. However, some Windows API calls use GetLastError() and WSAGetLastError(), in which case you must case AERON_SET_ERR_WIN, see below for more details. The remaining parameters are format string that is used to append context information to the error message. Note that the format string is always required, so that if you have a single string message that you want to include, then you should still use a single "%s" format string.

Validation Example

In this case we are checking that the configuration of an endpoint is valid for a publication, because the error is detected within this function and hasn't come from failed call to another Aeron function, then we use AERON_SET_ERR. We are select EINVAL as the error code as this is the closed matching error for the problem we have encountered.

static inline int aeron_driver_conductor_validate_endpoint_for_publication(aeron_udp_channel_t *udp_channel)
{
    if (!aeron_udp_channel_is_multi_destination(udp_channel) &&
        udp_channel->has_explicit_endpoint &&
        aeron_is_wildcard_port(&udp_channel->remote_data))
    {
        AERON_SET_ERR(
            EINVAL,
            "%s has port=0 for publication: channel=%.*s",
            AERON_UDP_CHANNEL_ENDPOINT_KEY,
            (int)udp_channel->uri_length,
            udp_channel->original_uri);
        return -1;
    }

    return 0;
}

https://docs.google.com/spreadsheets/d/1cEIh3M9fZniTUw25jaYsCq1C3-3HSwTfnyZR1caUeqY/edit#gid=401553572

External API Example

In this example we are calling the Linux library function sendmmsg, which returns -1 on failure and sets errno to the specific error. Because this error originated outside of the Aeron code base an in an library function we are using AERON_SET_ERR and in this case we use errno as the error number to use.

    int num_sent = sendmmsg(transport->fd, msg, msg_i, 0);
    if (num_sent < 0)
    {
        if (EAGAIN == errno || EWOULDBLOCK == errno || ECONNREFUSED == errno || EINTR == errno)
        {
            return 0;
        }
        else
        {
            char addr[AERON_NETUTIL_FORMATTED_MAX_LENGTH];
            aeron_format_source_identity(addr, sizeof(addr), address);
            AERON_SET_ERR(errno, "%s: address=%s (protocol_family=%i)", "failed to sendmmsg", addr, address->ss_family);
            return -1;
        }
    }
    else
    {
        for (int i = 0; i < num_sent; i++)
        {
            *bytes_sent += msg[i].msg_len;
        }

        return num_sent;
    }

AERON_SET_ERR_WIN

When working on Windows not all errors can be resolved using errno and strerror. Many of the Windows specific APIs, e.g. networking use different functions to report error codes and retrieve error descriptions. If the particular API call requires getting the error code via GetLastError() or WSAGetLastError() then you should use this macro to record the error. This macro will ensure that the Windows function FormatMessage is used instead of strerror to retrieve the descriptive form of the error.

int aeron_bind(aeron_socket_t fd, struct sockaddr *address, socklen_t address_length)
{
    if (SOCKET_ERROR == bind(fd, address, address_length))
    {
        char addr_str[AERON_NETUTIL_FORMATTED_MAX_LENGTH];
        aeron_format_source_identity(addr_str, sizeof(addr_str), (struct sockaddr_storage *)address);
        AERON_SET_ERR_WIN(WSAGetLastError(), "failed to bind to address: %s", addr_str);

        return -1;
    }

    return 0;
}

AERON_APPEND_ERR

The AERON_APPEND_ERR macro should be used whenever an error has already occurred and the calling code wants to add additional context and trace information to the error stack. Typically you should assume that if a function that is prefixed with aeron_ returns an error result (generally -1) then AERON_SET_ERR will have been called. This should be true for the vast majority of the Aeron code base. Any places where this is not done properly should be considered a bug and be addressed. The value of AERON_APPEND_ERR is two fold. Firstly the benefit of not just having an error message from the source of the error, but a trace of the path through the code can be invaluable, especially if there a multiple paths through the code to the place where the error occurred. Secondly it prevents the anti-pattern of having to pass context down the stack to lower level calls just for the purpose of having them included in the error message.

In the following example we are calling aeron_setsockopt which is a wrapper around the system supplied setsockopt. If that method fails, then the code will be calling AERON_SET_ERR internally, however that method does not understand the parameters passed in. I.e. in the context of the aeron_setsockopt (shown below for completeness) the value is just an opaque void pointer and the method can't do anything will it in order to include it in the error message. However, the caller know that this is specifically the multicast interface index and can log it as such, without having to pass that information down to aeron_setsockopt to be captured in the error.

...
    if (aeron_setsockopt(
        transport->fd, IPPROTO_IPV6, IPV6_MULTICAST_IF, &params->multicast_if_index, sizeof(params->multicast_if_index)) < 0)
    {
        AERON_APPEND_ERR("failed to set IPPROTO_IPV6/IPV6_MULTICAST_IF option to: %u", params->multicast_if_index);
        goto error;
    }
...

int aeron_setsockopt(aeron_socket_t fd, int level, int optname, const void *optval, socklen_t optlen)
{
    if (setsockopt(fd, level, optname, optval, optlen) < 0)
    {
        AERON_SET_ERR(errno, "setsockopt(fd=%d,...)", fd);
        return -1;
    }

    return 0;
}

Another common pattern with AERON_APPEND_ERR is to simply append an empty string. This is perfectly fine as allow we are doing here is ensure that we capture the location in the call stack. It is the recommended practice, whenever an error is returned to a caller just to put in AERON_APPEND_ERR("%s", ""). This still adds value as it include that location in the error trace even if there is no obvious context information that needs to be added.

    aeron_counter_t *counter;
    int64_t *counter_addr = aeron_counters_reader_addr(&conductor->counters_reader, response->counter_id);

    if (aeron_counter_create(
        &counter,
        conductor,
        response->correlation_id,
        response->counter_id,
        counter_addr) < 0)
    {
        AERON_APPEND_ERR("%s", "");
        return -1;
    }

Error Stack

When this works the resulting message will end up before formatted in a manner similar to the following:

ERROR - (22) Invalid argument
[aeron_driver_conductor_validate_channel_buffer_length, aeron_driver_conductor.c:345] so-sndbuf=65536 does not match existing value of 131072: existingChannel=aeron:udp?endpoint=localhost:9999|so-sndbuf=131072 channel=aeron:udp?endpoint=localhost:9999|so-sndbuf=65536
[aeron_driver_conductor_get_or_add_send_channel_endpoint, aeron_driver_conductor.c:2219] 

The first line will be the system error code, with a associated text for that error message. In this case 22 (EINVAL) with the text of "Invalid argument". The we have the error trace, with the top error trace line indicating where the error occurred and the subsequent lines showing the path to get there. It will list the C function, filename and the line number.

Calling AERON_APPEND_ERR without calling AERON_SET_ERR

This can sometimes happen if a function doesn't realise what it should be doing. In this case the error will simply not be reported. When we reach a code entry point, e.g. the ..._do_work method of an agent we will check any error code that may be set. If an error has occurred, then it will be logged and the error stack reset. If AERON_SET_ERR has not been called, then the error code may not be set correctly and the error not recorded or returned correctly to the user via a message.