Skip to content
This repository has been archived by the owner on Feb 3, 2023. It is now read-only.

Host Monitoring Not Enabling #365

Open
cidermark opened this issue Oct 6, 2021 · 18 comments
Open

Host Monitoring Not Enabling #365

cidermark opened this issue Oct 6, 2021 · 18 comments
Assignees
Labels
question Further information is requested

Comments

@cidermark
Copy link

Hi Guilhem,

I'm not sure if this a bug or a misconfiguration but I'm trying to enable monitoring on a load of hosts. Of the 112 hosts, 67 remain enabled but the other 45 revert to 'disabled' after the 5 minute refresh.

Is there some kind of log that I can look at to help diagnose the issue?

Cheers,
Mark.

@guilhemmarchand guilhemmarchand self-assigned this Oct 6, 2021
@guilhemmarchand guilhemmarchand added the question Further information is requested label Oct 6, 2021
@guilhemmarchand
Copy link
Owner

Hi Mark,

Hum right, there are a few conditions where host can be put on disabled automatically:

  1. trackme_auto_disablement_period

There is a macro, which you can check out in the UI "TrackMe Manage and configure", the default definition says the following:

relative_time(now(), "-45d")

This basically indicates by default that is a given data source did not receive any data for more than this period, the entity will get automatically disabled.

  1. custom rest call action

One could setup a custom alert action using the trackMe rest API:

https://trackme.readthedocs.io/en/latest/userguide.html#alerts-tracking-trackme-alert-actions

Basically one could have setup an action to disable the host automatically.

Note that the same thing could be achieved from the outside using a REST call.

In both cases, this would get tagged on the audit collection and the audit changes.

  1. custom report of yours

One could well have a custom logic to update the collection records based on a custom logic, basically updating the KVstore collection records.

In any case:

  • If TrackMe did it, there should be traces on the flipping status for that host, how this looks like?

  • Same, when the entity gets disabled, if the action comes from TrackMe this will be logged in the audit changes UI

Let me know if that makes sense

@cidermark
Copy link
Author

Hi Guilhem,
Thanks for the rapid response.

  1. I haven't changed this value from the installation default.
  2. I've not really done anything with the REST API
  3. I'm the only one looking/using/configuring TrackMe and I haven't created any custom reports - only enabled one of the default Alerts.

I had a look at the 'Flipping' status and the ones that keep getting reset look very different from the others:
image

(not sure if that image is viewable or not)

Cheers,
Mark

@guilhemmarchand
Copy link
Owner

Yes the screenshot is visible.

Hum this looks weird, seems to indicate that this host is continously being discovered over and over again.

Some question then:

I would recommend to be restrictive enough on the data hosts to start in good conditions, it tends to contain too much crap data and it's hard to a have a good vision.

So, I recommend generally to:

  • Add a few indexes to start in allow list in data host monitoring, which indexes you have qualified to be real indexes containing endpoint related data (avoid for example things like proxy data where the host value is in fact not an endpoint of yours these kind of things)

  • After you added a first index in allow list, reset the data host collection

  • Starting from there you will not need anymore to reset the collection

What does the record looks like?

| inputlookup trackme_host_monitoring | eval keyid=_key
| search data_host="xxx"

Can you try to delete the host from the UI, then run the tracker a few times to see how it is behaving

One option would be that you lots of crap in there, a very large number of host containing a very large number of sourcetypes etc.
For Data host you need to be restrictive and qualify properly what to include.

@guilhemmarchand
Copy link
Owner

@cidermark
Copy link
Author

cidermark commented Oct 6, 2021

HI there,
The hosts that I'm looking at here are all Splunk infrastructure (in this case, they are HF's) so, I guess, it's the _internal index that is being monitored. There doesn't appear to be any issues with the _internal logs.

  1. I've attached the output from the search. I also included 3 other hosts that are correctly being monitored. It's the last one in the list that has the problem.

  2. As suggested, I deleted the host and have run the short term and long term trackers several times. Unfortunately, the host hasn't been rediscovered :(

host.monitoring.results.csv

Mark

@guilhemmarchand
Copy link
Owner

@cidermark

When you deleted the host, did you use permanent deletion or temporary deletion?
If permanent it won't come up on its own

You can check your action in the audit change tab

As well:

  • Do you have anything in allow list for data hosts?
  • If you do, do you have the _internal?

Guilhem

@cidermark
Copy link
Author

@guilhemmarchand

I did a temporary deletion and ran both short and longterm trackers several times with the same results. I did then try a permanent delete :(

With regards to the allow/block - all lists are at the defaults installation settings. I haven't added or removed anything from those.

Mark.

@guilhemmarchand
Copy link
Owner

Hi @cidermark

When you delete an host through the UI, this creates a deletion record in the audit change, example:

| inputlookup trackme_audit_changes | eval key=_key | search object="EVENTGEN.RETAIL"

image

To allow the host to be re-created, you can update this record, for example:

| inputlookup trackme_audit_changes | eval key=_key | search object="EVENTGEN.RETAIL" AND key="615ecfddeb20813a9e41894f"
| eval change_type="delete temporary"
| outputlookup append=t key_field=key trackme_audit_changes

Then, when running the tracker the host can be re-created if the data allow it.

Now if you host still is not created, you can start from this search:

| savedsearch "TrackMe - Data hosts abstract root tracker"

| search data_host="EVENTGEN.RETAIL"

And check what is going on, you can expand the search and go step by step to understand why it wouldn't be created.
This savedsearch is called by both trackers

@cidermark
Copy link
Author

Hi @guilhemmarchand

I followed the advice to re-add the server and it's back in the list.

I re-enabled monitoring but, sadly, it still reverts back to not monitored after 5 minutes :(

Mark.

@guilhemmarchand
Copy link
Owner

@cidermark

Right, ok so now that it's back in the collection let's continue.
You basically say that the field data_monitored_state gets back disabled on its own, hum, there must be a reason.
Let me think about it and share a few searches to troubleshoot

@guilhemmarchand
Copy link
Owner

@cidermark

In my previous message I was showing this:

| savedsearch "TrackMe - Data hosts abstract root tracker"

| search data_host="EVENTGEN.RETAIL"

Adapt this to your own case, then run this command over the last 4 hours for instance, and expand the search.

You will get a quite large search, there are parts of the code which are dealing with the data_monitored_state:

image

image

While comparing these with yours, do you see anything special?

  1. local config

Can you please checkout in:

/opt/splunk/etc/apps/trackme/local/

And checkout any local config file you have, especially savedsearches.conf and macros.conf, anything in there?

@guilhemmarchand
Copy link
Owner

Hi @cidermark

Let me know if you have any update ;-)

@cidermark
Copy link
Author

Hi @guilhemmarchand ,

I'll get on to this as soon as I can but I'm away from my computer this week. Hopefully I'll be able to take a look tomorrow.

Mark

@guilhemmarchand
Copy link
Owner

No problem @cidermark just wanna make sure we don't leave that out.
If you keep struggling on this one then we could have some live chat and check this together.

Guilhem

@cidermark
Copy link
Author

Hi @guilhemmarchand
Sorry it took me a while to get back to you - this is my 1st day back!!!
I ran the ( slightly modified) search as requested and noticed that the servers that are not staying enabled seem to have quite a number of missing fields compared to the ones that are enabled. e.g. data_host_alerting_policy, data_previous_host_state, enable_behaviour_analytic, priority - 18 fields in all.

image

I couldn't find anything especially notable in the local directory - just a modified macros.conf and savedsearches.conf

Does this give any better insight as to what the problem may be?

Again, thanks for your help,
Mark

@cidermark
Copy link
Author

Hi @guilhemmarchand - any thoughts on my response?

Cheers,
Mark.

@guilhemmarchand
Copy link
Owner

Hi @cidermark

Thanks for the remind ;-)
Yes, as such it doesn't really allow me to understand the issue.

One potential root cause I think might be due to the search breaking due to a way too large number of sourcetypes for a x number of hosts.

This can happen with some bad practices such as dynamic sourcetyping, can you run:

So the following would show up with the biggest from the collection:

| inputlookup trackme_host_monitoring | eval keyid=_key
| eval len=len(data_host_st_summary)
| sort limit=0 - len
| table data_host, data_host_st_summary, *

Which could be reflected from the data:

| tstats count as data_eventcount where sourcetype=* host=* host!="" `trackme_tstats_main_filter` ( ( `trackme_get_idx_whitelist(trackme_data_host_monitoring_whitelist_index, data_index)` `apply_data_host_blacklists_data_retrieve` ) OR `trackme_tstats_main_filter_for_host` ) by index, sourcetype, host 
| stats dc(sourcetype) as dcount, values(sourcetype) by host
| sort 0 - dcount

What we want to find out is host have a seriously large number of sourcetypes, which should be excluded from the host tracking.

Let me know

@guilhemmarchand
Copy link
Owner

@cidermark

Thinking about it, the esiest might be that we have a look together, I believe you have some form of exceptions here and I am sure there's a reason.

My email is: guilhem.marchand@gmail.com
You can ping me on Splunk community Slack too then we can meet when convenient.

Guilhem

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants