Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

RFC: Evolve host.id into a default, always present attribute (rethinking machine-id) #581

Open
christos68k opened this issue Dec 4, 2023 · 10 comments
Assignees

Comments

@christos68k
Copy link
Member

christos68k commented Dec 4, 2023

Similarly to #311, I'd like to first describe some issues with the existing host.id semantics around machine-id usage and propose some alternatives.

According to machine-id(5):

It should be considered "confidential", and must not be exposed in untrusted environments, in particular on the network. If a stable unique identifier that is tied to the machine is needed for some application, the machine ID or any part of it must not be used directly. Instead the machine ID should be hashed with a cryptographic, keyed hash function, using a fixed, application-specific key.

This goes against OpenTelemetry guidelines which dictate using the value of /etc/machine-id (or /var/lib/dbus/machine-id) verbatim. Besides following the recommendation in the manpage (keyed hash), another alternative is to use UUIDv5 (SHA1), similarly to #312.

A secondary issue is that OpenTelemetry guidelines do not specify a fallback for host.id in cases where the machine-id is missing. For example, this is quite common in containerized environments if the Docker volume mount does not exist. In such cases, UUIDv4 can be used to generate a value that may be cached by the client application and reused (for as long as it makes sense given a context that's specific to the application).

@mx-psi
Copy link
Member

mx-psi commented Dec 7, 2023

cc @svrnm @sumo-drosiek @mwear (since you have worked on either spec or implementation of host.id on non-containerized systems)

@svrnm
Copy link
Member

svrnm commented Dec 7, 2023

I remember reading the machine-id man file, but missed that section about confidentiality. I don't see a reason given for that?

If I understand the UUIDv5 definition it is equivalent to a "keyed hash", so it appears to be a logical solution to do something similar to #312. The question is what the "namespace" is (assuming the name is the machine-id), I guess this is best a fixed value, since the goal here is to have a "unique identifier that is tied to the machine".

A secondary issue is that OpenTelemetry guidelines do not specify a fallback for host.id in cases where the machine-id is missing. For example, this is quite common in containerized environments if the Docker volume mount does not exist. In such cases, UUIDv4 can be used to generate a value that may be cached by the client application and reused (for as long as it makes sense given a context that's specific to the application).

This is a topic that drove me crazy. So, to get started: My perspective is that host.id is simply not defined within a container that does not provide it's own machine-id, probably it's just best to not have it at all: that's also why there is " For non-containerized systems, this should be the machine-id".

In the case of a container, the container.id is what you want to set. But there is no reliable way to obtain it from within the container, see containerd/containerd#8185

@mx-psi
Copy link
Member

mx-psi commented Dec 7, 2023

We discussed this on the System Semantic Conventions WG, let me try to summarize what I said (plus some last-minute thoughts).

My main points were:

  • Having a unique identifier of a host (let's assume a well-defined concept of a host, even if that's its own can of worms :) ) is a very useful thing to have for correlation and infrastructure monitoring generally. I have worked on this extensively at Datadog, where we (as it roughly happens in OTel) we rely on multiple sources and have multiple implementations. At Datadog, we typically do not use machine-id but rather use other sources that can be more meaningful to end-users or can be more easily retrieved (e.g. operating system hostname, EC2 instance id, Azure VM id, Kubernetes node and cluster name...).
  • Despite their usefulness, there are some times in which there is no meaningful unique identifier. This is the case for containerized systems or certain virtualized environments: container IDs are hard, and it could be that /etc/machine-id is empty as it happens on many container base images, or you may not have access to the hostname or it be a random one (also the case on containers many times!). This has been an issue for the OTel project in the past (see [processor/resourcedetection] system detector sets host.id to an empty value on containerized setups opentelemetry-collector-contrib#24230).
  • Generating a random identifier when you cannot get one is fragile in that restarts of the monitoring application (e.g. the Collector) or of the container churns new IDs. This can be a huge issue because of cardinality explosion and is also a problem we have had on OTel (see [resourcedetectionprocessor]: add host.id to system detector opentelemetry-collector-contrib#18618 (comment) for a user report, see "Additional context" in previously mentioned issue for even more details on how that happened). It's also hard (impossible?) to reliably tell that you are in a container, so you cannot (to my knowledge!) just generate it and store to a file just when you know it's going to be robust across restarts.
  • If this is really a security concern, we shouldn't be adding it on the OTel SDKs or the OTel Collector. We should instead hash it or some other issue. How to do this is somewhat arbitrary, so we should look at what other monitoring solutions do here and hash.

@svrnm
Copy link
Member

svrnm commented Dec 8, 2023

Based on your summary, the subject of this issue (Evolve host.id into a default, always present attribute) seems not to be something that can be accomplished (at least from within the OpenTelemetry community alone)?

A few extra comments:

Having a unique identifier of a host [...] is a very useful thing to have for correlation and infrastructure monitoring generally.

💯 -- this is really a BIG concern, coming up again and again. In a large environment when something goes wrong but only one (or a subset) of instances within a service are affected, it's crucial to know which one(s). As you outlined above, this is already not trivial for a lasting instance (bare metal, VMs, ...) and get's even more complicated with ephemeral instances (containers, ...). That's also why I raised containerd/containerd#8185, which has a history of similar issues attached to it (see opencontainers/runtime-spec#1105). @mx-psi if there is any value to it I can give a rundown of this to the System Semantic Conventions WG eventually.

It's also hard (impossible?) to reliably tell that you are in a container [...]

To call this out, even if the issue above with containers gets solved somehow, it will very likely stay optional, as it's in the nature of a being in a container to not know about being in a container (by default).

@christos68k
Copy link
Member Author

christos68k commented Dec 8, 2023

Given that we have no guidelines for containerized environments, does it make sense to add a containerized section (or generify the non-containerized one) and specify that the machine-id should be used if it is mapped inside the container? This is quite common through the volume mount and even works on Docker/macOS even though /etc/machine-id doesn't exist on the host:

macos$ cat /etc/machine-id
cat: /etc/machine-id: No such file or directory

macos$ docker run --rm -ti alpine /bin/sh
/ # cat /etc/machine-id
cat: can't open '/etc/machine-id': No such file or directory

macos$ docker run -v /etc/machine-id:/etc/machine-id --rm -ti alpine /bin/sh
/ # cat /etc/machine-id
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Does it make sense to further specify that in cases where machine-id is missing or empty, host.id may be populated with a 'stable' low-cardinality value and leave the implementation up to the SDK/user? An example of such a value could be a MAC address, possibly hashed with another id such as the hostname.

@svrnm
Copy link
Member

svrnm commented Dec 11, 2023

Given that we have no guidelines for containerized environments, does it make sense to add a containerized section (or generify the non-containerized one)

Under the assumption that a container is identified by container.id what is the value of having host.id as well? I am also wondering if host.id in that case is even well defined: a container is running within a container engine, that is running on a "host system", that may have a host.id itself. If I know want to say "container with ID x is running on host with ID y" what is the right host.id?

I think the implicit definition (so far), was that host.id is not set within a container.

@christos68k
Copy link
Member Author

christos68k commented Apr 3, 2024

Resurrecting this thread:

Under the assumption that a container is identified by container.id what is the value of having host.id as well?

These are different attributes with different lifetimes and different semantics. The value of host.id stems from it exhibiting less temporal changes than container.id, which allows for stable and meaningful correlation at the host level.

I am also wondering if host.id in that case is even well defined: a container is running within a container engine, that is running on a "host system", that may have a host.id itself. If I know want to say "container with ID x is running on host with ID y" what is the right host.id?

Going by the proposed updates in #576, in priority order:

  1. instance_id assigned by cloud provider (if cloud)
  2. machine-id (if mapped inside the container)
  3. Something else that we might or might not want to make explicit (e.g. low-cardinality computed value based on host attributes such as the MAC address)

I think the implicit definition (so far), was that host.id is not set within a container.

To give you an example, our universal profiling product comes with deployment instructions that map machine-id inside the container, and host.id is populated from it. This enables stable correlation across thousands of deployed agents that would otherwise not be possible. Enabling this volume mount is not uncommon and even Docker on macOS supports it.

To recap:

  1. I think we are all in agreement regarding hashing the machine-id value and not using it verbatim (e.g. see Define a common algorithm for service.instance.id #312)
  2. Can we agree that there is value in populating host.id inside containers, if a low-cardinality and stable value (e.g. machine-id) is available? That is to both encourage clients that can (or already) do this, but also to ensure that this behavior is not breaking the spec.

@svrnm
Copy link
Member

svrnm commented Apr 8, 2024

Can we agree that there is value in populating host.id inside containers

Yes, it is valuable, if and only if host.id is the id of the container host.

(e.g. machine-id) is available

Very often machine-id is not available from within the container and mounting it is also not possible/allowed (e.g. in managed environments), and for the same reasons it may not be desirable to expose it verbatim into the container. Additionally the container may create a machine-id on initialization itself and you need to distinguish that (maybe you can check if that file is mounted or not from within the container?)

@mx-psi
Copy link
Member

mx-psi commented Apr 8, 2024

Can we agree that there is value in populating host.id inside containers, if a low-cardinality and stable value (e.g. machine-id) is available? That is to both encourage clients that can (or already) do this, but also to ensure that this behavior is not breaking the spec.

I think there is value, but the discussion feels a bit theoretical to me. Other than the user passing it explicitly via an environment variable of some other sort of convention, I can't think of a way to reliably retrieve this value on a container and not fall prey to issues like open-telemetry/opentelemetry-collector-contrib#18618 (comment)

@svrnm
Copy link
Member

svrnm commented Apr 9, 2024

I can't think of a way to reliably retrieve this value on a container

I unfortunately hadn't have the time to follow up on this, but I still think that our best option is working with container projects to get an at least optional but standardized way of making a container + container host identifiable from within the container, see containerd/containerd#8185

florianl added a commit to elastic/apm that referenced this issue Dec 12, 2024
…ptional

`host.id` is a not well and uniquely defined attribute, see open-telemetry/semantic-conventions#581 for example. In particular on containerized environments profiling agents do see a different `host.id` than APM-agents, which makes it harder to correlate information.
To being able to correlate profiling and APM information, `container.id` was identified to fit the use case best. As profiling as well as APM agents already collect and send out `container.id` with their respective data. For non containerized environment `host.id` still can be used and in such a use cases profiling agents and APM-agents should have the same understanding of `host.id`.

For backwards compatibility reasons just make the argument for `host-id` in the registration message optional.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants