Write trin orchestrator logic #89

njgheorghita · 2021-09-02T15:40:02Z

This pr is motivated by some questions I had while getting the trin testnet up & running. It's totally possible that I completely misunderstood some design aspects, so please ask questions / point out design flaws made here. This pr is meant to provide more of an organizational blueprint of how we will build out trin.

Some questions I had about the current trin architecture...

Why is each subnetwork managing it's own jsonrpc server?
Why is each subnetwork building it's own base discv5 table?
How are these subnetworks going to communicate with each other (see Whether to use joint or isolated binaries, to orchestrate multiple networks #48) ?
How / where exactly are we going to handle the various data processing needed to serve certain jsonrpc endpoints (eth_getBalance, eth_call, etc.)

After some discussion with @carver, we came to the conclusion that there is a need for some orchestrator process to handle coordination.

In this pr, Trin is to act as a orchestrator for the sub-networks (history, gossip, state).
(side note, @carver we discussed building a trin-client crate to handle this, which was how I started, but in the end it seemed redundant not to put that logic in the root trin project, but I'm open to changing my mind - note that trin-core still exists for common logic)

trin:

Ultrathin wrapper around the bare minimum logic needed in the orchestrator and subsequently pass these off to subnetworks as they are initialized
- Responsibilities definitely include:
  - jsonrpc server/handler
    - Initially, the plan is for the handler to house the logic needed to serve jsonrpc endpoints - requesting various data from subnetworks as needed.
  - Base layer discv5 table
  - Message channels for inter-network communication
  - Initialize selected subnetworks
- Responsibilities possibly include:
  - Initialize db connection (see open ?s)
  - Handle dbus (see open ?s)

trin-core: Home for any logic that is shared amongst the sub-networks.

Subnetworks (trin-history, trin-state, trin-gossip)

Responsibilities upon initialization:
- Build their own overlay discv5 table from base discv5
- Start a handler for serving network data requests from jsonrpc handler

Command line API.

cargo run => starts base trin orchestrator and all available networks by default
cargo run -- --networks history => starts base trin orchestrator with just history subnetwork
cargo run -p trin-history => starts only the history package (i.e. this should be minimally useful, it won't have a jsonrpc server but will connect to the history network) (sidenote: why anyone would want to run just the binary, I'm not really sure, so is this even a use case we want to support?)

Poorly drawn picture

Open questions.

How are we handling databases? Does each subnetwork have their own? Is there one shared database? What are the tradeoffs involved with disk i/o?
Do we want to upgrade to a dbus-style communication mechanism?

carver

Can you check how much more work it would be to get discovery to only start up once?

carver · 2021-09-02T17:49:51Z

trin-core/src/jsonrpc.rs

+        "test_historyNetwork" => PortalEndpoint {
+            kind: PortalEndpointKind::DummyHistoryNetworkData,
+            resp: resp_tx,
+        },


Is this just a temporary thing to test that we can successfully dispatch to both subnets?

If so, maybe add a comment saying something like that, with a TODO to remove it when we have tests in CI that confirm it another way.

carver · 2021-09-02T20:10:47Z

trin-history/src/lib.rs

+}
+
+impl HistoryRequestHandler {
+    pub async fn process_network_requests(mut self) {


"network" requests makes me think peer to peer. But if I'm reading this right, it's all about serving the json-rpc requests, yeah? Another option is something like:

Suggested change

pub async fn process_network_requests(mut self) {

pub async fn handle_client_queries(mut self) {

In my head it was something along the lines of "process these requests (from the jsonrpc handler) for data from your (sub)network". But, I like your suggestion and updated it. The fewer times we use "network" the better.

carver · 2021-09-02T20:22:43Z

trin-core/src/portalnet/protocol.rs

+                    self.history_tx.as_ref().unwrap().send(message).unwrap();
+                    let response = match resp_rx.recv().await.unwrap() {


I'm not really clear. When might this panic in any of these 3 unwrap()s? If it's remotely possible, then we need some handling.

(and the same below, of course)

Yup, for sure there's a lot of error handling that I just kinda glossed over so far. I'll go back and implement some handling

carver · 2021-09-02T20:30:45Z

trin-state/src/lib.rs

@@ -51,13 +62,7 @@ pub async fn main() -> Result<(), Box<dyn std::error::Error>> {
    tokio::spawn(async move {
        let (mut p2p, events) = PortalnetProtocol::new(portalnet_config).await.unwrap();


Inside PortalnetProtocol it's launching discv5 right now. So currently, it would be launching discovery 3 different times. The direction I would head is to start discovery with trin-core, and pass a handle into both trin-state and trin-history so that neither of them try to start up their own discovery instance.

Can you check how much more work it would be to get discovery to only start up once?

Hmm, my understanding might be wrong here? But this is in the "main()" function which will only be called if somebody chooses to run this binary by itself (cargo run -p trin-state). In which case, wouldn't it want to boot up its own base discv5? I guess maybe there's some more thinking to be done about the "solo binary" story, do we want to support it, and be a bit more explicit about the functionality it would support.

When booting trin via cargo run, inside trin/src/lib.rs we only call initialize() for each subnet, not their main().

Ah yes, I had missed that. But... then aren't there other parts of the PortalnetProtocol class that should be run that aren't being run? Like creating PortalnetEvents and running tokio::spawn(events.process_discv5_requests());. That's the code that reacts to the inbound talkreq's, and serves the overlay protocol to peers.

Yeah, for sure those are needed - but to keep things simple, I just ignored it for this pr 😅 - tracking it here

njgheorghita · 2021-09-02T21:47:31Z

trin-core/src/portalnet/protocol.rs

@@ -97,6 +123,54 @@ impl JsonRpcHandler {
                        .collect();
                    let _ = cmd.resp.send(Ok(Value::Array(routing_table_info)));
                }
+                DummyHistoryNetworkData => {


@carver Any thoughts on this? Imo it's gross to read - but it does handle all the error cases correctly. It wasn't obvious to me how to break it out into a simpler function - but I'm happy to spend some more time cleaning it up.

Yeah, at the very least, I'd probably put all of this in its own method. I think it will be more obvious how to generalize as we add more requests that go to the same subnet. (But it does seem like passing in just a couple variables could mean being able to put most of this into a single shared method like proxy_query_to_subnet(subnet_tx, msg_kind) -> response.

trin-core/src/portalnet/protocol.rs

jacobkaufmann · 2021-09-03T02:52:07Z

src/lib.rs

+use trin_history::HistoryRequestHandler;
+use trin_state::StateRequestHandler;
+
+pub async fn main() -> Result<(), Box<dyn std::error::Error>> {


If this is the top-level orchestrator, would it be better to move this into main.rs and remove this lib.rs file? Based on the motivation, nothing else would need to call the main function of the orchestrator.

I guess trin root crate is considered binary right now, so it makes sense to move it into main.rs but I think we can keep lib.rs file in case if we want to extract some of the code outside of the main file.

At the moment, I'd prefer removing lib.rs entirely for now under the assumption that if/when the time comes that it's obvious we want to extract some code from main.rs then you can always create it later.

jacobkaufmann · 2021-09-03T03:35:01Z

As far as the database stuff goes, my assumption is that we would require multiple database instances because we would need to manage the storage capacity for each subnetwork individually. For example, a user may wish to allocate X space to state and Y space to history. I'm not sure this is possible with a single database instance, but I haven't looked into the RocksDB crate enough to be absolutely sure.

ogenev

Hey, thanks for thinking about the orchestration logic. I really like the idea of using one main json-rpc handler as an entry point to different network specific handlers.

My concern about the current json-rpc implementation, is that we may need to drop it soon 😄. I'm currently playing with paritytech's json-rpc library and I'm going to implement it in the testing framework branch. I guess next week we will have more clarity if we want to switch to this library or not. (looks promising so far, better error handling, logging and of course less boilerplate code).

I guess we should also think how we are going to orchestrate network specific discv5/overlay protocol instances, talkreq handlers and DBs. This probably also includes updating the CLI parser to handle more than one network args.

ogenev · 2021-09-03T11:21:41Z

src/lib.rs

+    let mut state_handler: Option<StateRequestHandler> = None;
+    let mut history_handler: Option<HistoryRequestHandler> = None;
+
+    if trin_config.networks.iter().any(|val| val == "history") {


I think we can combine this and the following iteration on trin_config.networks with something like:

for network in trin_config.networks.iter() { match network {....

I did like this idea, but in the end I prefer something more along the lines of what @jacobkaufmann suggested here since it gets rid of initializing empty mutable variables. I wasn't able to devise a way to combine both of your refactors (aka a more explicit iteration and no dead variable instantiation) but if you see something I missed lmk.

ogenev · 2021-09-03T11:38:42Z

trin-core/src/cli.rs

+    #[structopt(
+        long = "networks",
+        help = "Comma-separated list of which portal networks to activate",
+        default_value = "history,state",


I would set this as a constant and use default_value(DEFAULT_NETWORKS)

I'm not sure whether it would be better to instead use activation flags for each desired network. Of course this creates more CLI arguments, and committing to this pattern would require a new CLI argument for each subnetwork. However, it aligns with the mental model I have for a portal client, where the user may toggle each subnetwork on and off, and it simplifies the code that interprets the values for these arguments.

it would be better to instead use activation flags for each desired network

I think we can address this in a separate issue if it seems important to you.

Yeah, this is an interesting question. Imo it's somewhat out of scope for this pr, since it's a relatively simple change that can be implemented as various use cases arise and justify it. I created an issue here to track this conversation - and just going with DEFAULT_NETWORKS for now.

ogenev · 2021-09-03T11:52:17Z

src/lib.rs

+            portalnet_config.bootnode_enrs
+        );
+
+        let (mut p2p, events) = PortalnetProtocol::new(portalnet_config).await.unwrap();


This seems to run only one instance of the overlay network protocol and one talkreq handler. I think we should run one instance per network?

Yeah, this is where a bigger refactor needs to happen, where PortalnetProtocol becomes split into StatePortalProtocol::new(discovery, portalnet_config) and HistoryPortalProtocol::new(discovery, portalnet_config), so that discovery is run once and directly used by each subnet.

Yup, this pr is not by any means intended to be conclusive - just laying the basic blueprint for the orchestrator. So I'm only running the base discv5 protocol here. There are still a lot of open questions / implementation that needs to be done with regards to what is the orchestrator responsible for and how does it initialize each subnet. With that in mind, I tried to keep this pr fairly simple - and created an issue to track the conversation and implementation.

ogenev · 2021-09-03T11:55:37Z

src/lib.rs

+use trin_history::HistoryRequestHandler;
+use trin_state::StateRequestHandler;
+
+pub async fn main() -> Result<(), Box<dyn std::error::Error>> {


I guess trin root crate is considered binary right now, so it makes sense to move it into main.rs but I think we can keep lib.rs file in case if we want to extract some of the code outside of the main file.

ogenev · 2021-09-03T12:04:45Z

src/lib.rs

+pub async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    println!("Launching trin");
+
+    let trin_config = TrinConfig::new();


My understanding is that we want to run the different networks on separate UDP ports. If that is the case, we need to extend the CLI parser (StructOpt) to support command line args for more than one network.

My understanding is that we want to run the different networks on separate UDP ports.

This doesn't sound right to me. The subnets would all be running off of the same discv5 layer, which is what has opened the UDP port. Then the overlay messages are all happening over talkreq over the shared discv5.

Tracking this discussion here

jacobkaufmann · 2021-09-03T16:02:16Z

src/lib.rs

+    if trin_config.networks.iter().any(|val| val == "state") {
+        let (raw_state_tx, state_rx) = mpsc::unbounded_channel::<StateNetworkEndpoint>();
+        state_tx = Some(raw_state_tx);
+        state_handler = match trin_state::initialize(state_rx) {
+            Ok(val) => Some(val),
+            Err(msg) => panic!("Error while initializing state network: {:?}", msg),
+        }
+    }


Perhaps the following is a cleaner way than to declare the mut variables above?

Suggested change

if trin_config.networks.iter().any(|val| val == "state") {

let (raw_state_tx, state_rx) = mpsc::unbounded_channel::<StateNetworkEndpoint>();

state_tx = Some(raw_state_tx);

state_handler = match trin_state::initialize(state_rx) {

Ok(val) => Some(val),

Err(msg) => panic!("Error while initializing state network: {:?}", msg),

}

}

let (state_tx, state_handler) = if trin_config.networks.iter().any(|val| val == "state") {

let (state_tx, state_rx) = mpsc::unbounded_channel::<StateNetworkEndpoint>();

state_handler = match trin_state::initialize(state_rx) {

Ok(val) => val,

Err(msg) => panic!("Error while initializing state network: {:?}", msg),

}

(Some(state_tx), Some(state_handler))

} else { None, None };

Similarly for history.

I think I liked this one even better:

for network in trin_config.networks.iter() { match network {....

It's very possible I'm missing how to accomplish this, but I wasn't able to find a way to use the iter() strategy @ogenev suggested without instantiating None mut variables in advance. Imo those mut variables are grosser than the clunky iteration handling - but I have no strong feelings towards either approach.

mrferris · 2021-09-03T16:28:45Z

On the topic of storage, my thoughts are that we have separate database instances per network, each abstracted by a separate PortalStorage instance (#69) per network.

My main reasoning is simplicity. If we have every network's data combined in one DB instance, we'd need to store metadata on each key to denote which network it belongs to. The PortalStorage struct would need to get more complex in order to keep track of everything for every network at the same time....or we could have multiple PortalStorage instances interacting with a single database instance, in which case we need to manage locking etc...debugging storage functionality also gets more complex with everything in one RocksDB.

Is there a benefit to combining the databases that I might be missing? I'm only seeing the extra complexity. That said, I have yet to look into the effects of multiple DBs vs 1 DB on disk i/o, so that might be what I'm missing.

carver · 2021-09-03T18:42:02Z

my assumption is that we would require multiple database instances because we would need to manage the storage capacity for each subnetwork individually.

I think most users won't want to manage the storage space of each subnetwork individually (they probably won't even know the internal detail that there are multiple subnetworks). Think of a typical bittorrent user, and how little they probably know about the details of the protocol.

Even if we decide to implement this per-network limit, I'm sure we could find other ways than running multiple DBs. So I think we should at least remain open to the idea of piping everything into a single DB (which I expect to have performance benefits due to I/O serialization).

carver · 2021-09-03T18:44:22Z

My concern about the current json-rpc implementation, is that we may need to drop it soon smile. I'm currently playing with paritytech's json-rpc library and I'm going to implement it in the testing framework branch. I guess next week we will have more clarity if we want to switch to this library or not.

In theory, that work of swapping json-rpc library should be able to happen after this refactor, no? I would really like to avoid pausing work on this PR.

carver · 2021-09-03T18:55:41Z

Is there a benefit to combining the databases that I might be missing? I'm only seeing the extra complexity. That said, I have yet to look into the effects of multiple DBs vs 1 DB on disk i/o, so that might be what I'm missing.

Yup, that's exactly the issue. Some client (Erigon?) had to do a big rewrite to a single DB in order to avoid causing big I/O delays by trying to hammer the disk a bunch in parallel from different threads.

FWIW, I think we're going to have to deal with locking logic anyway, even within a single subnet.

njgheorghita · 2021-09-03T19:12:46Z

Moved the ongoing db discussion here since Imo it's a bit out of scope of this pr.

njgheorghita · 2021-09-03T19:45:16Z

Thanks everyone for the feedback! This pr was really intended to remain simple and just layout the basic blueprint for the orchestrator. I did a lot of hand-waving across a bunch of important questions & implementation details, but the goal of this pr is not to solve those all right now. These questions still deserve some discussion imo, and rather than housing them all inside this PR I've created tracking issues...

Imo all of these issues are implementation details and aren't inherently conflicting with the orchestrator blueprint. It establishes a common ground & jsonrpc server between the subnets, and is flexible enough to accommodate whatever we decide for the various open questions. But if you feel differently, please respond why

ogenev · 2021-09-03T20:42:13Z

In theory, that work of swapping json-rpc library should be able to happen after this refactor, no? I would really like to avoid pausing work on this PR.

Sure, swapping json-rpc shouldn't be a blocker for this PR.

carver · 2021-09-03T22:28:31Z

trin-core/src/cli.rs

@@ -9,6 +9,7 @@ use structopt::StructOpt;
 const DEFAULT_WEB3_IPC_PATH: &str = "/tmp/trin-jsonrpc.ipc";
 const DEFAULT_WEB3_HTTP_PORT: &str = "8545";
 const DEFAULT_DISCOVERY_PORT: &str = "9000";
+const DEFAULT_SUBNETWORKS: &str = "history,gossip";


Suggested change

const DEFAULT_SUBNETWORKS: &str = "history,gossip";

const DEFAULT_SUBNETWORKS: &str = "history,state";

carver · 2021-09-03T22:32:38Z

src/main.rs

+        if let Some(handler) = state_handler {
+            tokio::spawn(handler.handle_client_queries());
+        }
+        if let Some(handler) = history_handler {
+            tokio::spawn(handler.handle_client_queries());
+        }


If you feel like pursuing it, the direction I'd take is something like building a vector of subnet handlers and looping through them to spawn. 👌🏻 if you want to merge without trying it though.

oslfmt · 2021-09-04T20:18:44Z

Cargo.toml

    "trin-state",
-    "trin-history",
-    "trin-core"
 ]


What is the motivation for making trin-core a dependency and trin-history and trin-state as members of workspace? Are trin-core, trin-history, and trin-state packages or crates, making the entire directory as a workspace?

Overall my goal was to allow the CLI API written out in this pr's description. But, it's possible that I didn't choose the best configuration setup here. There's no need to expose trin-core to the user, since it's all internal code. But, I wanted to expose the ability for users to run trin-history and trin-state packages w/o the common trin orchestrator logic, if that's what they want to do for whatever reason. Did you see something I missed in this current setup?

Haha no, that makes total sense. I am just confused on all Cargo's terminology on crates, workspaces, packages, but what you mentioned helps makes things clearer for me.

oslfmt · 2021-09-04T20:35:30Z

trin-core/src/jsonrpc.rs

@@ -219,6 +219,8 @@ fn handle_request(
            "result": "trin 0.0.1-alpha",
        })
        .to_string()),
+        "test_historyNetwork" => dispatch_portal_request(obj, portal_tx),
+        "test_stateNetwork" => dispatch_portal_request(obj, portal_tx),


is there a cleaner way to combine the "test_historyNetwork", "test_stateNetwork", and the one below this since they all call the same function?

Yeah, there is for sure. But, it's not obvious to me currently the best way to clean this up, and I think the optimal refactor will emerge as we add more and more endpoints. I'm partial to leaving as-is for now, and just leaving a comment that refactoring should be handled soon.

njgheorghita force-pushed the trin-client branch 2 times, most recently from e0a2dd9 to 72069a9 Compare September 2, 2021 15:51

njgheorghita marked this pull request as ready for review September 2, 2021 15:55

njgheorghita force-pushed the trin-client branch 3 times, most recently from 2107fc8 to 3df43af Compare September 2, 2021 20:17

carver reviewed Sep 2, 2021

View reviewed changes

njgheorghita force-pushed the trin-client branch 2 times, most recently from 51e68ac to cb4cbaa Compare September 2, 2021 21:45

njgheorghita commented Sep 2, 2021

View reviewed changes

carver reviewed Sep 2, 2021

View reviewed changes

trin-core/src/portalnet/protocol.rs Outdated Show resolved Hide resolved

jacobkaufmann reviewed Sep 3, 2021

View reviewed changes

ogenev reviewed Sep 3, 2021

View reviewed changes

jacobkaufmann reviewed Sep 3, 2021

View reviewed changes

njgheorghita force-pushed the trin-client branch from 06df8b2 to bda97a9 Compare September 3, 2021 19:47

carver approved these changes Sep 3, 2021

View reviewed changes

oslfmt reviewed Sep 4, 2021

View reviewed changes

ogenev approved these changes Sep 6, 2021

View reviewed changes

njgheorghita added 3 commits September 7, 2021 11:50

Initial trin-client work

d08afad

Move trin-client logic to root trin process

fa79400

Improve error handling for subnet requests in jsonrpc handler

41ed949

njgheorghita added 2 commits September 7, 2021 11:50

PR feedback

80b86d3

Add trailing todo comments

e1a1603

njgheorghita force-pushed the trin-client branch from bda97a9 to e1a1603 Compare September 7, 2021 15:50

njgheorghita merged commit 2a0a580 into ethereum:master Sep 7, 2021

njgheorghita deleted the trin-client branch September 7, 2021 16:15

	pub async fn process_network_requests(mut self) {
	pub async fn handle_client_queries(mut self) {

		self.history_tx.as_ref().unwrap().send(message).unwrap();
		let response = match resp_rx.recv().await.unwrap() {

		@@ -51,13 +62,7 @@ pub async fn main() -> Result<(), Box<dyn std::error::Error>> {
		tokio::spawn(async move {
		let (mut p2p, events) = PortalnetProtocol::new(portalnet_config).await.unwrap();

	const DEFAULT_SUBNETWORKS: &str = "history,gossip";
	const DEFAULT_SUBNETWORKS: &str = "history,state";

Write trin orchestrator logic #89

Write trin orchestrator logic #89

Conversation

njgheorghita commented Sep 2, 2021 • edited Loading

carver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njgheorghita Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobkaufmann commented Sep 3, 2021

ogenev left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njgheorghita Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrferris commented Sep 3, 2021 • edited Loading

carver commented Sep 3, 2021

carver commented Sep 3, 2021

carver commented Sep 3, 2021

njgheorghita commented Sep 3, 2021

njgheorghita commented Sep 3, 2021 • edited Loading

ogenev commented Sep 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njgheorghita commented Sep 2, 2021 •

edited

Loading

njgheorghita Sep 2, 2021 •

edited

Loading

ogenev left a comment •

edited

Loading

njgheorghita Sep 3, 2021 •

edited

Loading

mrferris commented Sep 3, 2021 •

edited

Loading

njgheorghita commented Sep 3, 2021 •

edited

Loading