Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Write trin orchestrator logic #89

Merged
merged 5 commits into from
Sep 7, 2021

Conversation

njgheorghita
Copy link
Collaborator

@njgheorghita njgheorghita commented Sep 2, 2021

This pr is motivated by some questions I had while getting the trin testnet up & running. It's totally possible that I completely misunderstood some design aspects, so please ask questions / point out design flaws made here. This pr is meant to provide more of an organizational blueprint of how we will build out trin.

Some questions I had about the current trin architecture...

  • Why is each subnetwork managing it's own jsonrpc server?
  • Why is each subnetwork building it's own base discv5 table?
  • How are these subnetworks going to communicate with each other (see Whether to use joint or isolated binaries, to orchestrate multiple networks #48) ?
  • How / where exactly are we going to handle the various data processing needed to serve certain jsonrpc endpoints (eth_getBalance, eth_call, etc.)

After some discussion with @carver, we came to the conclusion that there is a need for some orchestrator process to handle coordination.

In this pr, Trin is to act as a orchestrator for the sub-networks (history, gossip, state).
(side note, @carver we discussed building a trin-client crate to handle this, which was how I started, but in the end it seemed redundant not to put that logic in the root trin project, but I'm open to changing my mind - note that trin-core still exists for common logic)

trin:

  • Ultrathin wrapper around the bare minimum logic needed in the orchestrator and subsequently pass these off to subnetworks as they are initialized
    • Responsibilities definitely include:
      • jsonrpc server/handler
        • Initially, the plan is for the handler to house the logic needed to serve jsonrpc endpoints - requesting various data from subnetworks as needed.
      • Base layer discv5 table
      • Message channels for inter-network communication
      • Initialize selected subnetworks
    • Responsibilities possibly include:
      • Initialize db connection (see open ?s)
      • Handle dbus (see open ?s)

trin-core: Home for any logic that is shared amongst the sub-networks.

Subnetworks (trin-history, trin-state, trin-gossip)

  • Responsibilities upon initialization:
    • Build their own overlay discv5 table from base discv5
    • Start a handler for serving network data requests from jsonrpc handler

Command line API.

  • cargo run => starts base trin orchestrator and all available networks by default
  • cargo run -- --networks history => starts base trin orchestrator with just history subnetwork
  • cargo run -p trin-history => starts only the history package (i.e. this should be minimally useful, it won't have a jsonrpc server but will connect to the history network) (sidenote: why anyone would want to run just the binary, I'm not really sure, so is this even a use case we want to support?)

Poorly drawn picture
trin-architecture

Open questions.

  • How are we handling databases? Does each subnetwork have their own? Is there one shared database? What are the tradeoffs involved with disk i/o?
  • Do we want to upgrade to a dbus-style communication mechanism?

@njgheorghita njgheorghita force-pushed the trin-client branch 2 times, most recently from e0a2dd9 to 72069a9 Compare September 2, 2021 15:51
@njgheorghita njgheorghita marked this pull request as ready for review September 2, 2021 15:55
@njgheorghita njgheorghita force-pushed the trin-client branch 3 times, most recently from 2107fc8 to 3df43af Compare September 2, 2021 20:17
Copy link
Collaborator

@carver carver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check how much more work it would be to get discovery to only start up once?

Comment on lines +258 to +265
"test_historyNetwork" => PortalEndpoint {
kind: PortalEndpointKind::DummyHistoryNetworkData,
resp: resp_tx,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a temporary thing to test that we can successfully dispatch to both subnets?

If so, maybe add a comment saying something like that, with a TODO to remove it when we have tests in CI that confirm it another way.

}

impl HistoryRequestHandler {
pub async fn process_network_requests(mut self) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"network" requests makes me think peer to peer. But if I'm reading this right, it's all about serving the json-rpc requests, yeah? Another option is something like:

Suggested change
pub async fn process_network_requests(mut self) {
pub async fn handle_client_queries(mut self) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head it was something along the lines of "process these requests (from the jsonrpc handler) for data from your (sub)network". But, I like your suggestion and updated it. The fewer times we use "network" the better.

Comment on lines 132 to 133
self.history_tx.as_ref().unwrap().send(message).unwrap();
let response = match resp_rx.recv().await.unwrap() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really clear. When might this panic in any of these 3 unwrap()s? If it's remotely possible, then we need some handling.

(and the same below, of course)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, for sure there's a lot of error handling that I just kinda glossed over so far. I'll go back and implement some handling

@@ -51,13 +62,7 @@ pub async fn main() -> Result<(), Box<dyn std::error::Error>> {
tokio::spawn(async move {
let (mut p2p, events) = PortalnetProtocol::new(portalnet_config).await.unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside PortalnetProtocol it's launching discv5 right now. So currently, it would be launching discovery 3 different times. The direction I would head is to start discovery with trin-core, and pass a handle into both trin-state and trin-history so that neither of them try to start up their own discovery instance.

Copy link
Collaborator Author

@njgheorghita njgheorghita Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check how much more work it would be to get discovery to only start up once?

Hmm, my understanding might be wrong here? But this is in the "main()" function which will only be called if somebody chooses to run this binary by itself (cargo run -p trin-state). In which case, wouldn't it want to boot up its own base discv5? I guess maybe there's some more thinking to be done about the "solo binary" story, do we want to support it, and be a bit more explicit about the functionality it would support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When booting trin via cargo run, inside trin/src/lib.rs we only call initialize() for each subnet, not their main().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I had missed that. But... then aren't there other parts of the PortalnetProtocol class that should be run that aren't being run? Like creating PortalnetEvents and running tokio::spawn(events.process_discv5_requests());. That's the code that reacts to the inbound talkreq's, and serves the overlay protocol to peers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for sure those are needed - but to keep things simple, I just ignored it for this pr 😅 - tracking it here

@njgheorghita njgheorghita force-pushed the trin-client branch 2 times, most recently from 51e68ac to cb4cbaa Compare September 2, 2021 21:45
@@ -97,6 +123,54 @@ impl JsonRpcHandler {
.collect();
let _ = cmd.resp.send(Ok(Value::Array(routing_table_info)));
}
DummyHistoryNetworkData => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carver Any thoughts on this? Imo it's gross to read - but it does handle all the error cases correctly. It wasn't obvious to me how to break it out into a simpler function - but I'm happy to spend some more time cleaning it up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at the very least, I'd probably put all of this in its own method. I think it will be more obvious how to generalize as we add more requests that go to the same subnet. (But it does seem like passing in just a couple variables could mean being able to put most of this into a single shared method like proxy_query_to_subnet(subnet_tx, msg_kind) -> response.

src/lib.rs Outdated
use trin_history::HistoryRequestHandler;
use trin_state::StateRequestHandler;

pub async fn main() -> Result<(), Box<dyn std::error::Error>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the top-level orchestrator, would it be better to move this into main.rs and remove this lib.rs file? Based on the motivation, nothing else would need to call the main function of the orchestrator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess trin root crate is considered binary right now, so it makes sense to move it into main.rs but I think we can keep lib.rs file in case if we want to extract some of the code outside of the main file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, I'd prefer removing lib.rs entirely for now under the assumption that if/when the time comes that it's obvious we want to extract some code from main.rs then you can always create it later.

@jacobkaufmann
Copy link
Collaborator

As far as the database stuff goes, my assumption is that we would require multiple database instances because we would need to manage the storage capacity for each subnetwork individually. For example, a user may wish to allocate X space to state and Y space to history. I'm not sure this is possible with a single database instance, but I haven't looked into the RocksDB crate enough to be absolutely sure.

Copy link
Member

@ogenev ogenev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for thinking about the orchestration logic. I really like the idea of using one main json-rpc handler as an entry point to different network specific handlers.

My concern about the current json-rpc implementation, is that we may need to drop it soon 😄. I'm currently playing with paritytech's json-rpc library and I'm going to implement it in the testing framework branch. I guess next week we will have more clarity if we want to switch to this library or not. (looks promising so far, better error handling, logging and of course less boilerplate code).

I guess we should also think how we are going to orchestrate network specific discv5/overlay protocol instances, talkreq handlers and DBs. This probably also includes updating the CLI parser to handle more than one network args.

src/lib.rs Outdated
let mut state_handler: Option<StateRequestHandler> = None;
let mut history_handler: Option<HistoryRequestHandler> = None;

if trin_config.networks.iter().any(|val| val == "history") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can combine this and the following iteration on trin_config.networks with something like:

for network in trin_config.networks.iter() {
        match network {....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did like this idea, but in the end I prefer something more along the lines of what @jacobkaufmann suggested here since it gets rid of initializing empty mutable variables. I wasn't able to devise a way to combine both of your refactors (aka a more explicit iteration and no dead variable instantiation) but if you see something I missed lmk.

#[structopt(
long = "networks",
help = "Comma-separated list of which portal networks to activate",
default_value = "history,state",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would set this as a constant and use default_value(DEFAULT_NETWORKS)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it would be better to instead use activation flags for each desired network. Of course this creates more CLI arguments, and committing to this pattern would require a new CLI argument for each subnetwork. However, it aligns with the mental model I have for a portal client, where the user may toggle each subnetwork on and off, and it simplifies the code that interprets the values for these arguments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to instead use activation flags for each desired network

I think we can address this in a separate issue if it seems important to you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is an interesting question. Imo it's somewhat out of scope for this pr, since it's a relatively simple change that can be implemented as various use cases arise and justify it. I created an issue here to track this conversation - and just going with DEFAULT_NETWORKS for now.

src/lib.rs Outdated
portalnet_config.bootnode_enrs
);

let (mut p2p, events) = PortalnetProtocol::new(portalnet_config).await.unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to run only one instance of the overlay network protocol and one talkreq handler. I think we should run one instance per network?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is where a bigger refactor needs to happen, where PortalnetProtocol becomes split into StatePortalProtocol::new(discovery, portalnet_config) and HistoryPortalProtocol::new(discovery, portalnet_config), so that discovery is run once and directly used by each subnet.

Copy link
Collaborator Author

@njgheorghita njgheorghita Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this pr is not by any means intended to be conclusive - just laying the basic blueprint for the orchestrator. So I'm only running the base discv5 protocol here. There are still a lot of open questions / implementation that needs to be done with regards to what is the orchestrator responsible for and how does it initialize each subnet. With that in mind, I tried to keep this pr fairly simple - and created an issue to track the conversation and implementation.

src/lib.rs Outdated
use trin_history::HistoryRequestHandler;
use trin_state::StateRequestHandler;

pub async fn main() -> Result<(), Box<dyn std::error::Error>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess trin root crate is considered binary right now, so it makes sense to move it into main.rs but I think we can keep lib.rs file in case if we want to extract some of the code outside of the main file.

src/lib.rs Outdated
pub async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Launching trin");

let trin_config = TrinConfig::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we want to run the different networks on separate UDP ports. If that is the case, we need to extend the CLI parser (StructOpt) to support command line args for more than one network.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we want to run the different networks on separate UDP ports.

This doesn't sound right to me. The subnets would all be running off of the same discv5 layer, which is what has opened the UDP port. Then the overlay messages are all happening over talkreq over the shared discv5.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking this discussion here

src/lib.rs Outdated
Comment on lines 57 to 64
if trin_config.networks.iter().any(|val| val == "state") {
let (raw_state_tx, state_rx) = mpsc::unbounded_channel::<StateNetworkEndpoint>();
state_tx = Some(raw_state_tx);
state_handler = match trin_state::initialize(state_rx) {
Ok(val) => Some(val),
Err(msg) => panic!("Error while initializing state network: {:?}", msg),
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the following is a cleaner way than to declare the mut variables above?

Suggested change
if trin_config.networks.iter().any(|val| val == "state") {
let (raw_state_tx, state_rx) = mpsc::unbounded_channel::<StateNetworkEndpoint>();
state_tx = Some(raw_state_tx);
state_handler = match trin_state::initialize(state_rx) {
Ok(val) => Some(val),
Err(msg) => panic!("Error while initializing state network: {:?}", msg),
}
}
let (state_tx, state_handler) = if trin_config.networks.iter().any(|val| val == "state") {
let (state_tx, state_rx) = mpsc::unbounded_channel::<StateNetworkEndpoint>();
state_handler = match trin_state::initialize(state_rx) {
Ok(val) => val,
Err(msg) => panic!("Error while initializing state network: {:?}", msg),
}
(Some(state_tx), Some(state_handler))
} else { None, None };

Similarly for history.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I liked this one even better:

for network in trin_config.networks.iter() {
        match network {....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very possible I'm missing how to accomplish this, but I wasn't able to find a way to use the iter() strategy @ogenev suggested without instantiating None mut variables in advance. Imo those mut variables are grosser than the clunky iteration handling - but I have no strong feelings towards either approach.

@mrferris
Copy link
Collaborator

mrferris commented Sep 3, 2021

On the topic of storage, my thoughts are that we have separate database instances per network, each abstracted by a separate PortalStorage instance (#69) per network.

My main reasoning is simplicity. If we have every network's data combined in one DB instance, we'd need to store metadata on each key to denote which network it belongs to. The PortalStorage struct would need to get more complex in order to keep track of everything for every network at the same time....or we could have multiple PortalStorage instances interacting with a single database instance, in which case we need to manage locking etc...debugging storage functionality also gets more complex with everything in one RocksDB.

Is there a benefit to combining the databases that I might be missing? I'm only seeing the extra complexity. That said, I have yet to look into the effects of multiple DBs vs 1 DB on disk i/o, so that might be what I'm missing.

@carver
Copy link
Collaborator

carver commented Sep 3, 2021

my assumption is that we would require multiple database instances because we would need to manage the storage capacity for each subnetwork individually.

I think most users won't want to manage the storage space of each subnetwork individually (they probably won't even know the internal detail that there are multiple subnetworks). Think of a typical bittorrent user, and how little they probably know about the details of the protocol.

Even if we decide to implement this per-network limit, I'm sure we could find other ways than running multiple DBs. So I think we should at least remain open to the idea of piping everything into a single DB (which I expect to have performance benefits due to I/O serialization).

@carver
Copy link
Collaborator

carver commented Sep 3, 2021

My concern about the current json-rpc implementation, is that we may need to drop it soon smile. I'm currently playing with paritytech's json-rpc library and I'm going to implement it in the testing framework branch. I guess next week we will have more clarity if we want to switch to this library or not.

In theory, that work of swapping json-rpc library should be able to happen after this refactor, no? I would really like to avoid pausing work on this PR.

@carver
Copy link
Collaborator

carver commented Sep 3, 2021

Is there a benefit to combining the databases that I might be missing? I'm only seeing the extra complexity. That said, I have yet to look into the effects of multiple DBs vs 1 DB on disk i/o, so that might be what I'm missing.

Yup, that's exactly the issue. Some client (Erigon?) had to do a big rewrite to a single DB in order to avoid causing big I/O delays by trying to hammer the disk a bunch in parallel from different threads.

FWIW, I think we're going to have to deal with locking logic anyway, even within a single subnet.

@njgheorghita
Copy link
Collaborator Author

Moved the ongoing db discussion here since Imo it's a bit out of scope of this pr.

@njgheorghita
Copy link
Collaborator Author

njgheorghita commented Sep 3, 2021

Thanks everyone for the feedback! This pr was really intended to remain simple and just layout the basic blueprint for the orchestrator. I did a lot of hand-waving across a bunch of important questions & implementation details, but the goal of this pr is not to solve those all right now. These questions still deserve some discussion imo, and rather than housing them all inside this PR I've created tracking issues...

Imo all of these issues are implementation details and aren't inherently conflicting with the orchestrator blueprint. It establishes a common ground & jsonrpc server between the subnets, and is flexible enough to accommodate whatever we decide for the various open questions. But if you feel differently, please respond why

@ogenev
Copy link
Member

ogenev commented Sep 3, 2021

In theory, that work of swapping json-rpc library should be able to happen after this refactor, no? I would really like to avoid pausing work on this PR.

Sure, swapping json-rpc shouldn't be a blocker for this PR.

@@ -9,6 +9,7 @@ use structopt::StructOpt;
const DEFAULT_WEB3_IPC_PATH: &str = "/tmp/trin-jsonrpc.ipc";
const DEFAULT_WEB3_HTTP_PORT: &str = "8545";
const DEFAULT_DISCOVERY_PORT: &str = "9000";
const DEFAULT_SUBNETWORKS: &str = "history,gossip";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const DEFAULT_SUBNETWORKS: &str = "history,gossip";
const DEFAULT_SUBNETWORKS: &str = "history,state";

Comment on lines +91 to +96
if let Some(handler) = state_handler {
tokio::spawn(handler.handle_client_queries());
}
if let Some(handler) = history_handler {
tokio::spawn(handler.handle_client_queries());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel like pursuing it, the direction I'd take is something like building a vector of subnet handlers and looping through them to spawn. 👌🏻 if you want to merge without trying it though.

"trin-state",
"trin-history",
"trin-core"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for making trin-core a dependency and trin-history and trin-state as members of workspace? Are trin-core, trin-history, and trin-state packages or crates, making the entire directory as a workspace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall my goal was to allow the CLI API written out in this pr's description. But, it's possible that I didn't choose the best configuration setup here. There's no need to expose trin-core to the user, since it's all internal code. But, I wanted to expose the ability for users to run trin-history and trin-state packages w/o the common trin orchestrator logic, if that's what they want to do for whatever reason. Did you see something I missed in this current setup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha no, that makes total sense. I am just confused on all Cargo's terminology on crates, workspaces, packages, but what you mentioned helps makes things clearer for me.

@@ -219,6 +219,8 @@ fn handle_request(
"result": "trin 0.0.1-alpha",
})
.to_string()),
"test_historyNetwork" => dispatch_portal_request(obj, portal_tx),
"test_stateNetwork" => dispatch_portal_request(obj, portal_tx),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a cleaner way to combine the "test_historyNetwork", "test_stateNetwork", and the one below this since they all call the same function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is for sure. But, it's not obvious to me currently the best way to clean this up, and I think the optimal refactor will emerge as we add more and more endpoints. I'm partial to leaving as-is for now, and just leaving a comment that refactoring should be handled soon.

@njgheorghita njgheorghita merged commit 2a0a580 into ethereum:master Sep 7, 2021
@njgheorghita njgheorghita deleted the trin-client branch September 7, 2021 16:15
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants