Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Better naming of variables (aggregates v/s jobs v/s aggregate_stores) #370

Open
kidrahahjo opened this issue Nov 26, 2020 · 13 comments
Open

Comments

@kidrahahjo
Copy link
Collaborator

Feature description

The next branch has many ambiguous variable names, and anyone could get confused with those names.
A specific example would be the use of jobs v/s aggregate in the method parameters

In context with the current code structure, for every operation, an aggregate store is created and to get all the created aggregates (we only support default aggregation for now), we iterate over that aggregate store.
For example, in run, for every aggregate, an instance of _JobOperation is created by initially iterating over the aggregate store for every operation.

for group in groups:
    for aggregate in aggregate_store:
        group._create_run_job_operations(aggregate, ...)

Proposed solution

A better and consistent naming is required for all private methods for now. I also vote on deprecating the jobs argument and replacing it with aggregates or while keeping the jobs argument, we could also add an aggregates argument but it may not be internally consistent as everything is an aggregate now.

@kidrahahjo kidrahahjo changed the title Better naming of variables Better naming of variables (aggregates v/s jobs v/s aggregate_stores) Nov 26, 2020
@bdice
Copy link
Member

bdice commented Dec 8, 2020

Another example: naming is inconsistent between _AggregatesStore (plural Aggregates) and _DefaultAggregateStore (singular Aggregate).

@kidrahahjo
Copy link
Collaborator Author

@bdice do you think we should rename the private _JobOperation class to _JobsOperation (or `_AggregateOperation)?

@bdice
Copy link
Member

bdice commented Dec 8, 2020

@kidrahahjo Using aggregates in place of jobs would be API-breaking for the "most stable" public methods in signac-flow (it changes the name of many public keyword arguments). It would also point towards breaking the CLI job selection flags like -j / --job-id. I have some suggestions below that would help to ensure consistency across the code base with minimal API breakage for users relying on our "most stable" public methods. Tagging @glotzerlab/signac-committers to weigh in on the rules proposed below.

tl;dr @kidrahahjo suggested deprecating the jobs argument and replacing it with aggregates OR keeping the jobs argument and adding an aggregates argument. By the rules I propose below, we would keep using jobs (very few public API changes) and expand the docs to explain the definition.

Below, names like job1 are instances of signac.contrib.job.Job.

job1 = project.open_job(id='abc')
job2 = project.open_job(id='19a')
...

Names like agg1 are tuples of jobs.

agg1 = (job1, job2)
agg2 = (job3, job4, job5)
agg3 = (job6,)
...

Case A: One job defined like job1.

  • Methods accepting "a single job" should use the argument job.
  • Example: a FlowOperation that only takes one job.
  • Example: the deprecated classes/methods that don't support aggregates.

Case B: One aggregate defined like agg1.

  • Methods accepting "a single aggregate of one or more jobs" should use the argument *jobs (variadic positional arguments that are instances of signac.contrib.job.Job).
  • Example: a FlowOperation that takes a variable number of jobs.
  • Example: condition functions that work for aggregates.

Case C: One or more aggregates defined as [agg1, agg2, ..., aggN].

  • Methods accepting "multiple aggregates of one or more jobs" should use the argument aggregates.
  • This implies that aggregates is always a sequence of tuples.
  • I didn't see this case used in project.py at a quick glance. The places where I saw the word aggregates appeared to be Case D.

Case D: One or more jobs or aggregates defined as [agg1, job2, ..., aggN].

  • Methods accepting "multiple aggregates of jobs" (like FlowProject.run) should use the argument jobs (no star).
  • I could see the argument that this should be aggregates but it would break many public APIs. We specifically avoided this by using *jobs in other places, to ensure backwards compatibility. I don't think the change of a core public API name for the new feature is warranted.
  • This case seems common in project.py, but it's named jobs in some places and aggregates in others.

Did I cover all the cases? 😅 Is this a reasonable set of rules?

@bdice
Copy link
Member

bdice commented Dec 8, 2020

@bdice do you think we should rename the private _JobOperation class to _JobsOperation (or `_AggregateOperation)?

@kidrahahjo You suggested a counterexample to my proposal before I even posted it, so maybe we need to just talk this through as a team 😂 . Case B would be applicable here but the variadic positional args make no sense for a class. I am fine with _JobOperation or _AggregateOperation but then this raises the question of how we deal with groups! Why not _AggregateGroup? Perhaps we should adopt an entirely new name that is unrelated like _ExecutionUnit... multiple execution units could make up a submission unit, if bundled? ...I have no idea. We already have enough confusion with "signac jobs" and "scheduler jobs" / "cluster jobs" (which mean the same thing).

@vyasr
Copy link
Contributor

vyasr commented Dec 8, 2020

@bdice do you think we should rename the private _JobOperation class to _JobsOperation (or `_AggregateOperation)?

@kidrahahjo You suggested a counterexample to my proposal before I even posted it, so maybe we need to just talk this through as a team 😂 . Case B would be applicable here but the variadic positional args make no sense for a class. I am fine with _JobOperation or _AggregateOperation but then this raises the question of how we deal with groups! Why not _AggregateGroup? Perhaps we should adopt an entirely new name that is unrelated like _ExecutionUnit... multiple execution units could make up a submission unit, if bundled? ...I have no idea. We already have enough confusion with "signac jobs" and "scheduler jobs" / "cluster jobs" (which mean the same thing).

I stated my position on these naming issues somewhere in Slack recently, I think you're intending to paraphrase my proposal with the _ExecutionUnit idea. I definitely want to separate objects in that fashion, rather than thinking about them in terms of signac jobs/operations/groups etc. The basic mapping that I would propose would be something like AggregateOperation->ExecutionUnit and AggregateGroup->SubmissionUnit. Obviously those names are not final, but they describe semantically what I want to achieve. I don't think the renaming benefits us a whole lot with the current structure of the code though, refactoring and streamlining so that these objects are actually used consistently would go a long way in that direction.

@bdice
Copy link
Member

bdice commented Dec 8, 2020

@vyasr Could you link to that discussion and copy the relevant parts into this issue for posterity? (Also sorry if I took credit for your idea, I don’t remember the conversation you mentioned.)

@vyasr
Copy link
Contributor

vyasr commented Dec 8, 2020

No worries, between our various projects I frequently hear my ideas repeated back to me a month or two later (like at dev meeting yesterday), I think you often internalize parts of the ideas and forget the conversations haha. In this case I think you proposed this name and I provided reasoning for how I think many of our objects should be renamed in line with the idea of execution/submission units. I just decided it's probably more helpful best to point out when that happens in the event that the original discussions help jog people's memories and push things along faster. Found the Slack conversation (https://signac.slack.com/archives/CPCMN12EL/p1599749818013000?thread_ts=1599730647.002500&cid=CPCMN12EL).

@vyasr
Copy link
Contributor

vyasr commented Dec 8, 2020

And here's the full thread pasted in from Slack:

Hardik 3 months ago
Hello everyone
I was working on aggregation and at one point it felt like the class _JobOperation may create confusion.
Since it won't just hold a single job in it.
Should the class be renamed to _JobsOperation instead?
25 replies

Bradley Dice 3 months ago
I believe this class also handles groups, not just single operations. Is that correct?

Mike 3 months ago
I'm less worried about the name when it isn't in the public API, will it ever hold only a single job? I don't know the full context but if it will always have more than one job, would _AggregateOperation be better? If _JobOperation makes the most sense then I think it is okay, I don't like _JobsOperation but it may be the most correct
(adding my thoughts to the thread instead of the all chat) (edited)

Hardik 3 months ago
Yes, this class also handles groups

Vyas Ramasubramani 3 months ago
i’m pretty sure that in the original PR for aggregation we proposed this exact rename

Bradley Dice 3 months ago
So _AggregateGroup would be the most general term but I think that sounds a little odd.

Vyas Ramasubramani 3 months ago
right, now that groups exist (which wasn’t the case in the original version) we’d want to account for that too

Bradley Dice 3 months ago
Otherwise _JobsOperations which is also strange-sounding to be.

Hardik 3 months ago
In the #52 (the first PR for aggregation), it was named JobsOperation

Hardik 3 months ago
what about AggregateOperations ?

Bradley Dice 3 months ago
I’m wondering if an entirely different name might be appropriate. Like _ExecutionTask (not my favorite but just to illustrate the idea of being “a piece of an execution/submission”)

Vyas Ramasubramani 3 months ago
i wouldn’t worry too much about this naming for now. to an extent i’d like to rename all these internal classes to reflect their role in the execution/submission process rather than what they contain
👍
1

Bradley Dice 3 months ago
We have many levels of interacting object hierarchies...

Vyas Ramasubramani 3 months ago
@bradley Dice yes exactly

Vyas Ramasubramani 3 months ago
i would like the concepts of “an atomic runnable object” and “an atomic submittable object” to be more clearly delineated and evident in naming
👍
2

Bradley Dice 3 months ago
I would vote to do nothing in the short term. This refactor can be done later with no harm, and I think this question would be better resolved after aggregation is complete and we move towards 1.0

Vyas Ramasubramani 3 months ago
with the generalizations of groups and aggregates, an AggregateOperation is the fundamental execution unit

Vyas Ramasubramani 3 months ago
whereas an AggregateGroup is the fundamental submission unit

Vyas Ramasubramani 3 months ago
yes, i think we postpone on this until the feature is complete

Bradley Dice 3 months ago
Hmm, good point about differentiating submission/execution @Vyas Ramasubramani. I hadn’t considered that.

Hardik 3 months ago
Okay! I'll push the changes I've made and request reviews.

Vyas Ramasubramani 3 months ago
i think the big push that needs to be made prior to 1.0 is clarifying our concepts, and i think that’s a critical one

Vyas Ramasubramani 3 months ago
clearly separating submission from running is also necessity if we ever want to outsource that logic (i still think that would be a nice long term goal, although obviously it’s not a priority right now)

Hardik 3 months ago
Yes you're right.
But since we've differentiated the login of submission and execution (for JobOperation by introducing SubmissionJobOperation which inherits from the former class. I'm not sure if we should truly separate the logic here.

Vyas Ramasubramani 3 months ago
the _SubmissionJobOperation isn’t really a submission unit, though; it’s just what gets constructed during the submission-time check for eligibility, right? also even that name is problematic since it interacts with groups, not just operations, right?
👍
1

Bradley Dice 3 months ago
@Vyas Ramasubramani I agree with your last few messages.

@bdice
Copy link
Member

bdice commented Mar 18, 2021

I think the action items here are:

  1. Make a single pass through project.py to ensure consistency in variable names after Enable Aggregate Operation #464 is merged.
  2. At a dev meeting, decide on a new name for _JobOperation and _SubmissionJobOperation and decide on what parts of that API should be public and how they should be documented. The name must be changed because it handles both groups and aggregates, breaking its semantic connection to single jobs/operations. Some name proposals include ExecutionUnit / SubmissionUnit, ExecutionTask / SubmissionTask. @atravitz and I agree that we should discuss this in a dev meeting and have a concrete answer so that this doesn't become a long Slack thread with no clear conclusion.

@vyasr
Copy link
Contributor

vyasr commented Mar 21, 2021

I agree, let's not discuss any further on this thread.

My two cents: I'd actually vote to table this discussion entirely for the moment. At this point we're stuck with enough of these names for a while that I'd recommend that we focus on making more substantive changes to clean up the code base before we try to decide on names. This feels like more of a 1.0 (or at least closer to 1.0) decision to me, especially because I'm not even sure that we know what should be in our public API yet. Once aggregation is done I think we'll need to go through a few minor releases with deprecation cycles so that we can stabilize (and fix the numerous inevitable bugs), then we can revisit this.

@bdice
Copy link
Member

bdice commented Mar 21, 2021

@vyasr The only reason I think we should make a decision (rather than tabling the discussion) is that our choice here has significant effects on how we write docs, error/help messages, and explain the package functionality. It completes definitions like “A bundle is composed of SubmissionTasks” or “the provided job and operation filters determine which ExecutionUnits will be run.”

@vyasr
Copy link
Contributor

vyasr commented Mar 21, 2021

I'm sympathetic to that, but my concern is that in the current state the names that I would want will only be semi-accurate, and worse, may have to change again if we don't have the right abstractions in place to begin with. My original proposal was to recognize that an AggregateOperation would be the unit of execution and an AggregateGroup would be the unit of submission, and to name things accordingly. In that world, an AggregateGroup would be composed of AggregateOperation instances, and a Bundle (the total submission) would be composed of AggregateGroup instances. That's mostly true, but there are all sorts of reasons why this mapping isn't quite right. Some examples of the reasons that I think terminology weird edge cases we currently have:

  1. The way that groups translate to run vs submission operations.
  2. The way that we distinguish what a user sees with the true representation, which is not just an implementation detail but can actually be important. For example, some of the kinds of things you might want to write in the documentation are affected by the relationship between the operation decorator, the FlowOperation, and the JobOperation, and it would be fantastic if that was less complex.
  3. Conceptually, if you introduce things this way you'd expect a lot of these components to be transparently composable, but they're not in the current API.

That's just a handful of examples; looking at some of our other open issues will point to others. I'm not saying that they're necessarily dealbreakers, but in my view they point to lots of code smells and design flaws that we'd want to rework when refactoring, and they are likely to lead to meaningful changes in the API (at least the recommended API, even if we continue to indefinitely support older APIs). At the very least, I think we need to consider this before trying to move forward.

In the interest of avoiding exactly what we said we wouldn't do (discuss forever on this thread) I've asked our current community manager @mikemhenry to add this issue to the topic list for our next dev meeting. I'll stop posting responses after this one.

@csadorf
Copy link
Contributor

csadorf commented Mar 25, 2021

I think we need some nice overview diagram that demonstrates how all of these components work together. Otherwise the discussion is very different to follow.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants