Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Further package structure suggestions #88

Closed
Doppe1g4nger opened this issue Oct 3, 2018 · 8 comments
Closed

Further package structure suggestions #88

Doppe1g4nger opened this issue Oct 3, 2018 · 8 comments

Comments

@Doppe1g4nger
Copy link
Contributor

In addition to the issue in #87 I think it may also be worth looking at the way the training data generation and handling process is structured. E.g. what are currently the top level modules metalearning, featurization, and preprocessing.

I think it was agreed that metalearning isn't a very accurate descriptor of what that submodule does. I think a better name would be something like "heuristic featurizer selection" or maybe just "featurizer selection".

A bigger sell I'd also like to put forward is that preprocessing is a bit of a misnomer for what that submodule currently does. Really all three of the above submodules are part of a preprocessing pipeline. What is currently preprocessing is more accurately a combined data cleaning and feature pruning step. It would make sense to bundle each of these under a comprehensive "preprocessing" or "data handling" module and split out the functionality of each submodule into its constituent steps. That way every top level module produces some complete product for the user and submodules then represent steps to completing that product. See the below diagram on what I think the better structure would look like:

higher_res_pipeline

@ardunn
Copy link
Contributor

ardunn commented Oct 3, 2018

I am ok with this

@ardunn
Copy link
Contributor

ardunn commented Oct 3, 2018

One disadvantage to this scheme though is that the user would not get a featurized dataset without already having the pruning done on it. In other words it might be nice to get a featurized dataframe back (containing all features) before tons of features and/or samples are removed.

For example:
User: Oh here I will just put my dataset into this matbench preprocessor. Let's see what I get back.
Preprocessor: Removes 100 samples and eliminates a bunch of features the user might be expecting
User: Wait where did my 100 samples go? And why are there only 10 features returned?

Idk though, what do @albalu @utf @Qi-max think?

@albalu
Copy link
Contributor

albalu commented Oct 3, 2018

I am fine with the proposed pipeline (nice flowchart BTW). I think it's ok if we automatically do some preprocessing and logging it. That could be the default behavior. If someone wants to change something in their pipeline, they can define a new one (as simple as copy and pasting 10 lines of codes) and then change the args within each step or even taking out some steps (e.g. feature reduction). In other words, make their custom pipeline while taking advantage of convenience functions available in matbench. Similar philosophy is implemented in atomate where the user don't have much control over every step of the calculations, only the main stuff, but if they want more control they can define the same workflow but just change a small portion (as I have).

@Doppe1g4nger
Copy link
Contributor Author

I see your point, I think that's something that can be handled in the preprocessing module though. It wouldn't just successively apply the steps, but would also have some control logic on which steps are applied/how to apply them. The __init__ could look something like:

def __init__(input_data, prune_features=True, remove_malformed_samples=True, ...):
    ...
    ...

then the user can set these flags here or in the top level interface to specify how they want to have the data processed, with the default being to apply all steps in the preprocessing pipeline.

@utf
Copy link
Member

utf commented Oct 3, 2018

Agree that the flow chart is nice. As long as the packages are modular then this shouldn't be an issue for users wanting to customize their pipelines.

I guess we then see the package level classes of Preprocessing, ModelSelection and Analysis as really being convenience classes to bring together the rest of the functionality defined in that package.

In the same way that the top level class will be a convenience function for all the packages put together.

@ardunn
Copy link
Contributor

ardunn commented Oct 3, 2018

Still would be good to have solid sub-packages for Feature Selection, Featurization, Cleaning, etc. We should probably refactor Preprocessing class --> DataCleaner or something

@ardunn
Copy link
Contributor

ardunn commented Oct 19, 2018

@Doppe1g4nger I think we will hold off on moving Featurization, FeaturizerSelection, and Preprocessing into the same subpackage for now. Although I will raise a separate issue/todo for splitting [current] preprocessing into two classes as you suggested, DataCleaner and FeatureReduction or something

@ardunn ardunn closed this as completed Oct 19, 2018
@ardunn
Copy link
Contributor

ardunn commented Oct 19, 2018

We can always reopen later if moving everything into one subpackage for preprocessing is advantageous

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants