Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

REF: Israel pipeline #518

Merged
merged 7 commits into from
Feb 22, 2021
Merged

Conversation

lucasrodes
Copy link
Member

@lucasrodes lucasrodes commented Feb 19, 2021

Refactored Israel vaccination script according to pipeline approach proposed in #465.

@covid19owid @ValentinMouret @edomt

Some open questions:

  • What do you think about using df instead of input (as the latter is a reserved keyword)?
  • As we refactor, we might identify which functions (steps) could be placed in a commons file, I believe that is the intent behind file utils.pipeline

@lucasrodes lucasrodes changed the title israel pipeline REF: Israel pipeline Feb 19, 2021
@covid19owid
Copy link
Contributor

@lucasrodes Thank you !

Will update the scripts with appropriate variables instead of input as it is reserved. Also will extract common functions into utils.

@lucasrodes
Copy link
Member Author

Awesome @covid19owid! Thanks for your work! I can work on that with you.

I'd suggest doing small incremental changes in the PRs affecting few files (e.g. changes in few pipelines, few methods in utils.pipeline, etc.). I think this way the code reviewing is simpler (let's make @edomt's life easier 😄) and it is less probable that there are code collisions (i.e. duplicate work)

@thkeeper
Copy link

thkeeper commented Feb 20, 2021 via email

@ValentinMouret
Copy link
Contributor

Hey!

  • No strong opinions regarding the same. df is as bland as input, and that’s kind of the point of the function: being agnostic of what’s fed to it (as long as it has the right type). I don’t mind the collision because I cannot remember the last time I  actually used input :D
    Could be data, could be df, no strong opinions.
    I know that some linters complain about using the reserved keyword, so it might be wise to use another one.
  • You are totally right in that some steps might be reused, and that’s one of the strengths of this approach. I put those functions inside of utils.pipeline because I wasn’t sure how the scripts were run, and that has an impact on how things should be packaged. I.e., they probably should keep this simplicity that if you only want to run one script you can basically run python script.py and it works, rather than with proper entry points.

@ValentinMouret
Copy link
Contributor

I tried to make those reusable steps emerge from working with the pipelines to:

  • avoid over-engineering / having too much abstraction
  • have steps naturally emerge as they are made: «oh, we always have a translate step, then some preprocessing, then...»
  • the bliss of deleting lines in pipelines by reusing functions

"vaccinated_cum": "people_vaccinated",
"vaccinated_seconde_dose_cum": "people_fully_vaccinated"
})
def format_date(df: pd.DataFrame) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps those should be decoupled into:

def format_date(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(date=df.date.str.slice(0, 10))

def filter_date(df: pd.DataFrame) -> pd.DataFrame:
    return df[df.date < str(datetime.date.today())]

This decouples the logic (cognitive load, reusability), and we don’t mutate inputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks


def select_distinct(df: pd.DataFrame) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s return directly without a temporary variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@lucasrodes
Copy link
Member Author

lucasrodes commented Feb 20, 2021

Thanks for your feedback, @ValentinMouret. Very much appreciated.

I don’t mind the collision because I cannot remember the last time I  actually used input :D

Same 😄

Could be data, could be df, no strong opinions. I know that some linters complain about using the reserved keyword, so it might be wise to use another one.

I think it is a good practice not to use reserved keywords, just to avoid confusion if other people were to use the code. But like you, no strong opinion either.

You are totally right in that some steps might be reused, and that’s one of the strengths of this approach. I put those functions inside of utils.pipeline because I wasn’t sure how the scripts were run, and that has an impact on how things should be packaged. I.e., they probably should keep this simplicity that if you only want to run one script you can basically run python script.py and it works, rather than with proper entry points.

Agree. I think for now it is totally fine to place the re-usable code in utils.pipeline. If in the future we think that further packaging the project makes sense, then the work of "pre-packaging" will be extremely valuable. If not, well, we did reduce code redundancy.

I tried to make those reusable steps emerge from working with the pipelines to:

avoid over-engineering / having too much abstraction
have steps naturally emerge as they are made: «oh, we always have a translate step, then some preprocessing, then...»
the bliss of deleting lines in pipelines by reusing functions

I really have liked your proposal. I might tend to go for the "having too much abstraction" way often, so please control me 😄

Copy link
Member Author

@lucasrodes lucasrodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok


def select_distinct(df: pd.DataFrame) -> pd.DataFrame:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"vaccinated_cum": "people_vaccinated",
"vaccinated_seconde_dose_cum": "people_fully_vaccinated"
})
def format_date(df: pd.DataFrame) -> pd.DataFrame:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks

@edomt
Copy link
Collaborator

edomt commented Feb 22, 2021

Thank you @lucasrodes & @ValentinMouret :)

@edomt edomt merged commit 64cb291 into owid:master Feb 22, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants