Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Document pre-requisites for running Elastic Agent in unprivileged mode #4705

Closed
12 tasks done
ycombinator opened this issue May 8, 2024 · 16 comments · Fixed by elastic/ingest-docs#1087
Closed
12 tasks done
Assignees
Labels
documentation Improvements or additions to documentation Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@ycombinator
Copy link
Contributor

ycombinator commented May 8, 2024

Background

Traditionally, privileged users (e.g. root on Linux) run Elastic Agent on a host. However, with #3598, #4362, #4264, and other follow-up PRs, it is now possible to run Elastic Agent with an unprivileged user.

Problem statement

Running Agent as an unprivileged user has consequences. Not only does the Agent itself run as an unprivileged user, but so do the process components it orchestrates, e.g. the various Beats. Consequently, any integrations being handled by such components, e.g. system, might not have the necessary access on the host to collect all the data they can when running as a privileged user. The result is that users do not see data they might be expecting in these integrations' dashboards. Some examples of this situation are:

Similarly, users might encounter other issues related to the installing or running of Elastic Agent in privileged mode. Some examples of this situation are:

Definition of done

Let's use this issue to collect any pre-requisites a user must perform to install and run Elastic Agent in unprivileged mode, as well as any other gotchas they might run into when using the system integration with an Elastic Agent running in unprivileged mode.

For each pre-requisite let's capture the following information:

  1. What steps does the user need to take as a prerequisite to running Elastic Agent in unprivileged mode?
  2. What would the impact be if these prerequisite steps were not taken? Or, put differently, what functionality is enabled as a result of taking these prerequisite steps?
  3. What symptoms (e.g. errors) will the user observe and where if these prerequisite steps were not taken?

MacOS

Preview Give feedback

Linux

Preview Give feedback

Windows

Preview Give feedback
@ycombinator ycombinator added documentation Improvements or additions to documentation Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 8, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@ycombinator
Copy link
Contributor Author

cc: @kilfoyle

@kilfoyle
Copy link
Contributor

Thanks for opening this @ycombinator! I like the organization. Once we have all the pre-requisites info I can add a table into the docs mapping each function to the pre-requisite(s) associated with using it in unprivileged mode.

I'm thinking that we can also have a troubleshooting section with something like:

  • a list of possible error messages
  • for each message, a brief explanation of what it indicates: i.e., that the prerequisite for running Function X in unprivileged mode isn't satisfied
  • a link back to the "function to pre-requisites" table mentioned above

And thanks @kaanyalti for taking this one on!

@ycombinator
Copy link
Contributor Author

In #4125 (comment), @kilfoyle said:

@kaanyalti I think the "pre-requisites and gotchas" could go in tables like these, but we can update the format once the list becomes more clear.

@kaanyalti
Copy link
Contributor

kaanyalti commented Jun 7, 2024

Mac tests

  • agent version 8.14.0
  • Production ESS 8.14
  • Create agent policy with system integration
  • Enroll in fleet in unprivileged mode
  • logs
    Image
[elastic_agent.filebeat][error] Harvester could not be started on new file: /var/log/system.log, Err: error setting up harvester: Harvester setup failed. Unexpected file opening error: Failed opening /var/log/system.log: open /var/log/system.log: permission denied

found this open issue related to these logs elastic/beats#39733

Note

Giving read permission to the elastic-agent group for the /var/log/system.log file fixes this error.

Dashboards:

[Logs System] Syslog dashboard

Privileged

Image

Unprivileged

Image

Note

Giving read permission to the elastic-agent group for the /var/log/system.log file fixes the discrepancy between the two dashboards

[Metrics System] Host overview

Privileged

Image

Unprivileged

Image

Agent doesn't seem to have all the processes listed in the cpu and memory usage lists. It looks like only the processes run by the elastic-agent-user user are shown in the cpu and memory. To validate this I ran a cpu stress test twice, one as elastic-agent-user and the other one as the logged in user. I used the stress package to run the tests.

Logged in user

Ran stress -c 2 -t 60 and observed the host overview dashboard. As it can be seen in the image below, cpu usage went up; however, there process is not listed.

Pasted image 20240608005151
elastic-agent-user

Ran sudo -u elastic-agent-user stress -c 2 -t 60 and observed similar increase in cpu usage, and additionally saw that the process is indeed listed

Pasted image 20240608004948

This confirms that only the processes ran by the elastic-agent-user user are shown in the cpu and memory usage list.

Important

Please note that the cpu usage panel showed changes in the usage regardless of the user running the test.

[Elastic Agent] Agent Info

Privileged

Image

Unprivileged

Image

Note

Giving read permission to the elastic-agent group for the /var/log/system.log file fixes the discrepancy between the two dashboards
Related: #4675 (comment)

[Elastic Agent] Integrations

Privileged

Image

Unprivileged

Image

Note

Giving read permission to the elastic-agent group for the /var/log/system.log file fixes the discrepancy between the two dashboards

Linux tests

  • agent version 8.14.0
  • Production ESS 8.14
  • Use the same policy used for mac testing
  • Install agent in fleet mode, unprivileged
  • check the integration
  • log
    • Again the same harvester error
[elastic_agent.filebeat][error] Harvester could not be started on new file: /var/log/auth.log.1, Err: error setting up harvester: Harvester setup failed. Unexpected file opening error: Failed opening /var/log/auth.log.1: open /var/log/auth.log.1: permission denied

Note

Giving read permission to the elastic-agent group for the /var/log/syslog and /var/log/auth.log files fixes this error.

  • metrics
    • different failure compared to what we ran into for mac
     [elastic_agent.metricbeat][error] error getting filesystem usage for /run/user/1000/gvfs: error in Statfs syscall: permission denied
    

We can't really give elastic-agent-user read access to files in the /run/user/1000/ directory

Dashboards:

[Logs System] Syslog dashboard

Privileged

Image

Unprivileged

Image

Giving read permission to the elastic-agent group for the /var/log/syslog and /var/log/auth.log files fixes the discrepancy between the two dashboards

[Elastic Agent] Agent Info

Privileged

Image

Unprivileged

Image

Giving read permission to the elastic-agent group for the /var/log/syslog and /var/log/auth.log files fixes the discrepancy between the two dashboards, although not completely. There are still errors because unprivileged elastic agent tries to access files in /run/user/1000

[Elastic Agent] Integrations

Privileged

Image

Unprivileged

Image

Giving read permission to the elastic-agent group for the /var/log/syslog and /var/log/auth.log files fixes the discrepancy between the two dashboards

@VihasMakwana
Copy link
Contributor

VihasMakwana commented Jun 20, 2024

Windows

Without sufficient permissions, I faced following errors and almost nothing was showing up on dashboards (as expected).

  • failed to open Windows Event Log channel "Security": Access is denied
    • Fixed this by adding user to Event Log Users group
  • cannot open new key in the registry in order to enable the performance counters: Access is denied
    • Fixed by updating permissions for HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\PartMgr registry.

@jlind23
Copy link
Contributor

jlind23 commented Jun 20, 2024

@ycombinator @pierrehilbert now that @VihasMakwana and @kaanyalti tested all the different combinations (unless I am missing something) what would be the next step here? Should @VihasMakwana start writing down all the settings that need to be changed when running unprivileged?

@VihasMakwana
Copy link
Contributor

Windows

Dashboards after giving sufficient permissions as per this comment

  • [Elastic Agent] Overview

 Unprivileged 
Screenshot 2024-06-20 at 7 33 51 PM
 
 Privileged
 Screenshot 2024-06-20 at 5 29 25 PM 
  
  • [Elastic Agent] Agent metrics

 Unprivileged
Screenshot 2024-06-20 at 8 06 13 PM
 
 Privileged 
Screenshot 2024-06-20 at 5 30 40 PM 
Screenshot 2024-06-20 at 5 30 50 PM
  • [Elastic Agent] Input Metrics

 Unprivileged
Screenshot 2024-06-20 at 8 07 24 PM
 Privileged 
Screenshot 2024-06-20 at 5 31 46 PM
 
  • [Elastic Agent] Filestream Input Metrics

 Unprivileged
Screenshot 2024-06-20 at 8 10 43 PM
 
 Privileged 
Screenshot 2024-06-20 at 5 32 34 PM 
  • [Elastic Agent] Info

 Unprivileged
Screenshot 2024-06-20 at 8 13 27 PM
 
 Privileged
Screenshot 2024-06-20 at 5 35 00 PM
 
  • System Overview

 Unprivileged
Screenshot 2024-06-20 at 8 14 12 PM
 
 Privileged
Screenshot 2024-06-20 at 5 35 33 PM
 
  • Host Overview

 Unprivileged
Screenshot 2024-06-20 at 5 36 05 PM
Screenshot 2024-06-20 at 8 20 28 PM
Screenshot 2024-06-20 at 8 20 39 PM

 

Note

Here we can see that elastic-user-agent doesn't have access to all processes. I had a vscode window running.
It doesn't show up in unprivileged mode but it shows up with privileged access.

 Privileged
Screenshot 2024-06-20 at 8 15 13 PM
Screenshot 2024-06-20 at 5 36 18 PM
Screenshot 2024-06-20 at 5 36 36 PM

 

@VihasMakwana
Copy link
Contributor

VihasMakwana commented Jun 20, 2024

@ycombinator @blakerouse @pierrehilbert @cmacknz

There's one particular error showing up after giving all necessary privileges.

  • Could not return any performance counter values for \\.\C: .Error: Access is denied.
  • This is triggered here when it tries to open a volume for later DeviceIoControlcall.

It tries to open a volume (not the filesystem) to fetch performance counters. As per this microsoft doc,

Direct access to the disk or to a volume is restricted.

We can fix this error by giving administrative privileges to our unprivileged user. But isn't it the very thing we're trying to avoid?

How should we tackle this?

cc: @jlind23

@VihasMakwana
Copy link
Contributor

VihasMakwana commented Jun 20, 2024

@ycombinator @blakerouse @pierrehilbert @cmacknz

There's one particular error showing up after giving all necessary privileges.

  • Could not return any performance counter values for \\.\C: .Error: Access is denied.
  • This is triggered here when it tries to open a volume for later DeviceIoControlcall.

It tries to open a volume (not the filesystem) to fetch performance counters. As per this microsoft doc,

Direct access to the disk or to a volume is restricted.

We can fix this error by giving administrative privileges to our unprivileged user. But isn't it the very thing we're trying to avoid?

How should we tackle this?

cc: @jlind23

NOTE
This error is not fatal, it only affect "DIsk Usage" section in "Host Overview" Dashboards.
system.diskio data stream is only the thing missing.

Screenshot 2024-06-20 at 8 54 32 PM Screenshot 2024-06-20 at 8 54 43 PM

@ycombinator
Copy link
Contributor Author

@ycombinator @pierrehilbert now that @VihasMakwana and @kaanyalti tested all the different combinations (unless I am missing something) what would be the next step here? Should @VihasMakwana start writing down all the settings that need to be changed when running unprivileged?

Once @VihasMakwana is done working through all the scenarios (== all checkboxes in the Definition of Done are checked), we will end up with a bunch of comments in this issue going over the scenarios and what changes had to be done to make it work in unprivileged mode. At that point, we can close this issue.

@kilfoyle is already aware of this issue and is going to port over the comments into proper documentation.

@jlind23
Copy link
Contributor

jlind23 commented Jun 20, 2024

I would be happy to get @ycombinator and @cmacknz opinion on #4705 (comment) but worst case it can be documented as a known limitation.

@ycombinator
Copy link
Contributor Author

I would be happy to get @ycombinator and @cmacknz opinion on #4705 (comment) but worst case it can be documented as a known limitation.

There's one particular error showing up after giving all necessary privileges.
Could not return any performance counter values for \\.\C: .Error: Access is denied.

I assume this happens when you use the system integration on Windows? Is it for any specific dataset with the system integration?

Where are you seeing this error — in the Agent logs? Or somewhere else? Does it occur once or frequently? Also, are there any other, more visible, symptoms as a result, e.g. an empty dashboard or some message in the Fleet UI?

Where I'm going with these questions is: maybe if Agent knows it's running in unprivileged mode, it could perhaps not even try to access the volume rather than emitting the error (especially if we're emitting this error frequently). I realize the code for accessing the volume is buried quite deep so this would mean passing down the necessary information to that level.

We can fix this error by giving administrative privileges to our unprivileged user. But isn't it the very thing we're trying to avoid?

Indeed. In this case, the "fix" is to run Elastic Agent in privileged mode. So I would definitely document the symptoms and mention that the observed behavior is expected in unprivileged mode.

@VihasMakwana
Copy link
Contributor

I assume this happens when you use the system integration on Windows? Is it for any specific dataset with the system integration?

This specifically happens for system.diskio dataset.

Where are you seeing this error — in the Agent logs? Or somewhere else? Does it occur once or frequently? Also, are there any other, more visible, symptoms as a result, e.g. an empty dashboard or some message in the Fleet UI?

Yes, this is in agent logs. It occurs frequently and the frequency depends on the period config.
There is an empty section in "Host Overview" dashboard here.

Where I'm going with these questions is: maybe if Agent knows it's running in unprivileged mode, it could perhaps not even try to access the volume rather than emitting the error (especially if we're emitting this error frequently). I realize the code for accessing the volume is buried quite deep so this would mean passing down the necessary information to that level.

I'm doing research on this part. I'll open a separate issue to track this scenario.

@kilfoyle
Copy link
Contributor

Thanks @kaanyalti and @VihasMakwana for the super clear guidance around these limitations!

@ycombinator I've tried to capture everything as part of the Add steps and details for running 'unprivileged' Elastic Agent PR. A preview of the limitations is available in the Agent and dashboard behaviors in unprivileged mode section of the new page.

Please let me know whatever may need fixing up. :-)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
documentation Improvements or additions to documentation Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants