Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

operator-friendly quorum health reporting #4523

Open
marta-lokhova opened this issue Oct 25, 2024 · 6 comments
Open

operator-friendly quorum health reporting #4523

marta-lokhova opened this issue Oct 25, 2024 · 6 comments

Comments

@marta-lokhova
Copy link
Contributor

We should signal to operators vital information about the health of their quorum set as well as transitive quorum in general. Currently, we have info in quorum, quorum intersection checker warnings, plus in each consensus round we report missing/delayed nodes. Some ideas on how this can be improved from an operator point of view (specifically, it should be easy to understand that an action is needed):

  • Aggregate "missing" information overtime and warn that the nodes might be down (in a human-friendly form, using node names instead of pubkeys if possible). Indicate a quorum set change might be needed.
  • Better track changes in the transitive quorum, display the diff in a readable form. It would be nice to identify and flag changes in Tier1 specifically, however this might be more involved as we don't have Tier1 computation logic in core as of right now.
  • As part of this, I think node names are important. We should consider ways to propagate nodeID<>nodeName pairs on the network so transitive quorum looks sane to operators (it should not be hashed into the ledger, but do some sort of best effort mapping)
@MonsieurNicolas
Copy link
Contributor

Aggregate "missing" information overtime and warn that the nodes might be down (in a human-friendly form, using node names instead of pubkeys if possible). Indicate a quorum set change might be needed.

We have to be careful with this. We don't want to give the wrong impression that removing nodes from their quorum set would increase reliability (what we've seen already happen with archives)

Better track changes in the transitive quorum, display the diff in a readable form. It would be nice to identify and flag changes in Tier1 specifically, however this might be more involved as we don't have Tier1 computation logic in core as of right now.

I thought "tier1" was basically computed by the quorum intersection code?

That being said, as the target audience here are people running watcher nodes, simplified algos like running "pagerank" over qsets from validators in the node config and display entries that cross some threshold and are not in the config could go a long way.

As part of this, I think node names are important. We should consider ways to propagate nodeID<>nodeName pairs on the network so transitive quorum looks sane to operators (it should not be hashed into the ledger, but do some sort of best effort mapping)

there is a SEP for this actually, so we could use ledger information. That being said, I don't know if this will continue to be true (state archival).
It's actually not that bad to reverse lookup keys in tools like stellarbeat

@heytdep
Copy link

heytdep commented Oct 27, 2024

I think this is part of a bigger discussion around being able to access core data more easily for operators (thus also users). Everyone should be able to easily access and process data (and they theoretically are), for instance regarding the quorum health there should be workers that monitor whether validators aren't arbitrarily including transactions that favor their activities when the network is congested (e.g monitoring the frequency of included txs that are not in the node's mempool), etc. I don't think that pushing data to downstream and making aggregations and assumptions about how watcher nodes might want to do this is the correct way.

My view is that watcher nodes should be able to easily work along with core at a lower level and decide for themselves how they want to process health factors, changes, etc. (This also incentivizes a better research and network economy imo). I've started working on a stellar-core fork that shares data at runtime over to a rust bridge using shm. Processing the data from the rust service becomes much easier for an operator as it's running in parallel (still figuring out about making this thread safe) and you can plug in a much simpler codebase. I'm curious about how the core team feels about this approach (also because I'm planning a push an MVP of this to Mercury testnet quite soon).

On a separate note

It's actually not that bad to reverse lookup keys in tools like stellarbeat

I agree with this.

@MonsieurNicolas
Copy link
Contributor

@heytdep great to hear you're also interested in better auditing infrastructure. It's an area where we didn't do a whole lot until very recently, so lots of opportunities.

We've actually already made some good experiments by pushing data into the data pipeline for historical analysis (we've already started to use it to catch potential bias from validators) @sydneynotthecity probably has a lot more to say on the topic (and links to issues where discussions can happen).

I'd say that if we're missing data, we should look into ways to push it into those data lakes to make it as easy and efficient as possible for data scientists (or anyone really) to analyze things and find interesting anomalies/trends without having to deal with the dark world of C++ :)

As for this specific issue, I think the scope here is much simpler: without historical data (and related infrastructure), we're trying to give as much feedback as possible to node operators so that they at least know that they're supposed to go look at more advanced tools (stellarbeat or others) and confirm that their configuration is correct.

@heytdep
Copy link

heytdep commented Oct 29, 2024

@MonsieurNicolas awesome! Would very much like to learn about the current experiments happening on that end, I do remember Garand mentioning something about making this kind of data more accessible through Hubble.

I think I'll keep working on my experimental "rust bridge" atm even just to not just have one sdf implementation, and to learn more about core's codebase (and making the information bidirectional, i.e also feeding instructions to core from rust, still unsure about the safety here tho), but would be awesome to be able to access more vote and set related data similarly/with the same ease as how we currently access the transitions.

@sydneynotthecity
Copy link

@heytdep this is something we've been thinking about on the data team as well. We are particularly interested in assessing how txset compositions change between validators if validators choose to enable cap-0042 (and the implications for fees). In general, it would be nice if some of the quorum sets and validator data was more easily accessible.

We've added some fields to Hubble to make it easier to audit bias at the validator level. See stellar-etl#249 for some additional context. We now publish the node id + signature that signed each ledger tx set. We discussed adding SCP info and validator messages to our dataset, but ultimately decided that the data was too fine-grain without better understanding what operators would do with that level of detail.

The aggregate information @marta-lokhova proposes above is something we could include in our ETL pipeline and make publicly available and expand out to publishing full qset details and messages (if needed)

Given the current data published in Hubble, you can actually do a bit of experimentation today. Anything saved in LedgerCloseMeta is easily accessible, and our team built a Validator Analytics project to demonstrate how easy it would be to build some monitoring/tooling around detecting bias in validators. (Do some tier 1 orgs close transactions more often than others? Do some orgs favor certain operation types in their txsets?) This is something that we're thinking of extending further next year--if you think we need to include fine grain SCP info or have requests on other data you'd like to see in Hubble, please open an issue in stellar-etl!

@heytdep
Copy link

heytdep commented Oct 30, 2024

@sydneynotthecity thanks for this. In general I'd say that any data in the LedgerCloseMeta is already well externalized, both by core, the cdp, or zephyr. It's great to see this data also coming to Hubble though.

What I was working on is a bit more lower level to monitor that leader validators are not pushing valid txs to the set without publishing them first. But yeah it's great to have more data externalized about validator bias too, I'll try and squeeze in some time to build a zephyr version as well to increase awareness.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants