Feature preprocessors, Loss strategies #86

ravinkohli · 2021-02-05T15:41:46Z

Added the following components from old autopytorch code:

Feature Preprocessors like Kitchen sinks, FastICA, KernelPCA, RandomKitchenSinks, Nystroem, PolynomialFeatures, PowerTransformer, TruncatedSVD
Weighted Loss strategies for binary and multiclass classification

autoPyTorch/datasets/base_dataset.py

...reprocessing/tabular_preprocessing/feature_preprocessing/base_feature_preprocessor_choice.py

franchuterivera · 2021-02-08T10:26:29Z

...reprocessing/tabular_preprocessing/feature_preprocessing/base_feature_preprocessor_choice.py

+        # add only child hyperparameters of early_preprocessor choices
+        for name in preprocessor.choices:
+            updates = self._get_search_space_updates(prefix=name)
+            config_space = available_[name].get_hyperparameter_search_space(dataset_properties,  # type:ignore


why type ignore here?

so as get_hyperparameter_search_space of the base component only has dataset properties as a parameter, it gives a "too many arguments for get_hyperparameter search space" error. If I add a **kwargs to the base component, it gives signature invalid for every class inheriting from base component. On searching on google, I found this, where they are suggesting to use # type: ignore

…ocessors, add test for pipeline include

* ADD Weighted loss * Now? * Fix tests, flake, mypy * Fix tests * Fix mypy * change back sklearn requirement * Assert for fast ica sklearn bug * Forgot to add skip * Fix tests, changed num only data to float * removed fast ica * change num only dataset * Increased number of features in num only * Increase timeout for pytest * ADD tensorboard to requirement * Fix bug with small_preprocess * Fix bug in pytest execution * Fix tests * ADD error is raised if default not in include * Added dynamic search space for deciding n components in feature preprocessors, add test for pipeline include * Moved back to random configs in tabular test * Added floor and ceil and handling of logs * Fix flake * Remove TruncatedSVD from cs if num numerical ==1 * ADD flakyness to network accuracy test * fix flake * remove cla to pytest

* New refactor code. Initial push * Allow specifying the network type in include (automl#78) * Allow specifying the network type in include * Fix test flake 8 * fix test api * increased time for func eval in cros validation * Addressed comments Co-authored-by: Ravin Kohli <kohliravin7@gmail.com> * Search space update (automl#80) * Added Hyperparameter Search space updates * added test for search space update * Added Hyperparameter Search space updates * added test for search space update * Added hyperparameter search space updates to network, trainer and improved check for search space updates * Fix mypy, flake8 * Fix tests and silly mistake in base_pipeline * Fix flake * added _cs_updates to dummy component * fixed indentation and isinstance comment * fixed silly error * Addressed comments from fransisco * added value error for search space updates * ADD tests for setting range of config space * fic utils search space update * Make sure the performance of pipeline is at least 0.8 * Early stop fixes * Network Cleanup (automl#81) * removed old supported_tasks dictionary from heads, added some docstrings and some small fixes * removed old supported_tasks attribute and updated doc strings in base backbone and base head components * removed old supported_tasks attribute from network backbones * put time series backbones in separate files, add doc strings and refactored search space arguments * split image networks into separate files, add doc strings and refactor search space * fix typo * add an intial simple backbone test similar to the network head test * fix flake8 * fixed imports in backbones and heads * added new network backbone and head tests * enabled tests for adding custom backbones and heads, added required properties to base head and base backbone * First documentation * Default to ubuntu-18.04 * Comment enhancements * Feature preprocessors, Loss strategies (automl#86) * ADD Weighted loss * Now? * Fix tests, flake, mypy * Fix tests * Fix mypy * change back sklearn requirement * Assert for fast ica sklearn bug * Forgot to add skip * Fix tests, changed num only data to float * removed fast ica * change num only dataset * Increased number of features in num only * Increase timeout for pytest * ADD tensorboard to requirement * Fix bug with small_preprocess * Fix bug in pytest execution * Fix tests * ADD error is raised if default not in include * Added dynamic search space for deciding n components in feature preprocessors, add test for pipeline include * Moved back to random configs in tabular test * Added floor and ceil and handling of logs * Fix flake * Remove TruncatedSVD from cs if num numerical ==1 * ADD flakyness to network accuracy test * fix flake * remove cla to pytest * Validate the input to autopytorch * Bug fixes after rebase * Move to new scikit learn * Remove dangerous convert dtype * Try to remove random float error again and make data pickable * Tets pickle on versions higher than 3.6 * Tets pickle on versions higher than 3.6 * Comment fixes * Adding tabular regression pipeline (automl#85) * removed old supported_tasks dictionary from heads, added some docstrings and some small fixes * removed old supported_tasks attribute and updated doc strings in base backbone and base head components * removed old supported_tasks attribute from network backbones * put time series backbones in separate files, add doc strings and refactored search space arguments * split image networks into separate files, add doc strings and refactor search space * fix typo * add an intial simple backbone test similar to the network head test * fix flake8 * fixed imports in backbones and heads * added new network backbone and head tests * enabled tests for adding custom backbones and heads, added required properties to base head and base backbone * adding tabular regression pipeline * fix flake8 * adding tabular regression pipeline * fix flake8 * fix regression test * fix indentation and comments, undo change in base network * pipeline fitting tests now check the expected output shape dynamically based on the input data * refactored trainer tests, added trainer test for regression * remove regression from mixup unitest * use pandas unique instead of numpy * [IMPORTANT] added proper target casting based on task type to base trainer * adding tabular regression task to api * adding tabular regression example, some small fixes * new/more tests for tabular regression * fix mypy and flake8 errors from merge * fix issues with new weighted loss and regression tasks * change tabular column transformer to use net fit_dictionary_tabular fixture * fixing tests, replaced num_classes with output_shape * fixes after merge * adding voting regressor wrapper * fix mypy and flake * updated example * lower r2 target * address comments * increasing timeout * increase number of labels in test_losses because it occasionally failed if one class was not in the labels * lower regression lr in score test until seeding properly works * fix randomization in feature validator test * Make sure the performance of pipeline is at least 0.8 * Early stop fixes * Network Cleanup (automl#81) * removed old supported_tasks dictionary from heads, added some docstrings and some small fixes * removed old supported_tasks attribute and updated doc strings in base backbone and base head components * removed old supported_tasks attribute from network backbones * put time series backbones in separate files, add doc strings and refactored search space arguments * split image networks into separate files, add doc strings and refactor search space * fix typo * add an intial simple backbone test similar to the network head test * fix flake8 * fixed imports in backbones and heads * added new network backbone and head tests * enabled tests for adding custom backbones and heads, added required properties to base head and base backbone * First documentation * Default to ubuntu-18.04 * Comment enhancements * Feature preprocessors, Loss strategies (automl#86) * ADD Weighted loss * Now? * Fix tests, flake, mypy * Fix tests * Fix mypy * change back sklearn requirement * Assert for fast ica sklearn bug * Forgot to add skip * Fix tests, changed num only data to float * removed fast ica * change num only dataset * Increased number of features in num only * Increase timeout for pytest * ADD tensorboard to requirement * Fix bug with small_preprocess * Fix bug in pytest execution * Fix tests * ADD error is raised if default not in include * Added dynamic search space for deciding n components in feature preprocessors, add test for pipeline include * Moved back to random configs in tabular test * Added floor and ceil and handling of logs * Fix flake * Remove TruncatedSVD from cs if num numerical ==1 * ADD flakyness to network accuracy test * fix flake * remove cla to pytest * Validate the input to autopytorch * Bug fixes after rebase * Move to new scikit learn * Remove dangerous convert dtype * Try to remove random float error again and make data pickable * Tets pickle on versions higher than 3.6 * Tets pickle on versions higher than 3.6 * Comment fixes * [REFACTORING]: no change in the functionalities, inputs, returns * Modified an error message * [Test error fix]: Fixed the error caused by flake8 * [Test error fix]: Fixed the error caused by flake8 * FIX weighted loss issue (automl#94) * Changed tests for losses and how weighted strategy is handled in the base trainer * Addressed comments from francisco * Fix training test * Re-arranged tests and moved test_setup to pytest * Reduced search space for dummy forward backward pass of backbones * Fix typo * ADD Doc string to loss function * Logger enhancements * show_models * Move to spawn * Adding missing logger line * Feedback from comments * ADD_109 * No print allow * [PR response]: deleted unneeded changes from merge and fixed the doc-string. * fixed the for loop in type_check based on samuel's review * deleted blank space pointed out by flake8 * Try no autouse * handle nans in categorical columns (automl#118) * handle nans in categorical columns * Fixed error in self dtypes * Addressed comments from francisco * Forgot to commit * Fix flake * Embedding layer (automl#91) * work in progress * in progress * Working network embedding * ADD tests for network embedding * Removed ordinal encoder * Removed ordinal encoder * Add seed for test_losses for reproducibility * Addressed comments * fix flake * fix test import training * ADD_109 * No print allow * Fix tests and move to boston * Debug issue with python 3.6 * Debug for python3.6 * Run only debug file * work in progress * in progress * Working network embedding * ADD tests for network embedding * Removed ordinal encoder * Removed ordinal encoder * Addressed comments * fix flake * fix test import training * Fix tests and move to boston * Debug issue with python 3.6 * Run only debug file * Debug for python3.6 * print paths of parent dir * Trying to run examples * Trying to run examples * Add success model * Added parent directory for printing paths * Try no autouse * print log file to see if backend is saving num run * Setup logger in backend * handle nans in categorical columns (automl#118) * handle nans in categorical columns * Fixed error in self dtypes * Addressed comments from francisco * Forgot to commit * Fix flake * try without embeddings * work in progress * in progress * Working network embedding * ADD tests for network embedding * Removed ordinal encoder * Removed ordinal encoder * Addressed comments * fix flake * fix test import training * Fix tests and move to boston * Debug issue with python 3.6 * Run only debug file * Debug for python3.6 * work in progress * in progress * Working network embedding * ADD tests for network embedding * print paths of parent dir * Trying to run examples * Trying to run examples * Add success model * Added parent directory for printing paths * print log file to see if backend is saving num run * Setup logger in backend * try without embeddings * no embedding for python 3.6 * Deleted debug example * Fix test for evaluation * Deleted utils file Co-authored-by: chico <francisco.rivera.valverde@gmail.com> * Fixes to address automlbenchmark problems * Fix trajectory file output * modified the doc-string in TransformSubset in base_dataset.py * change config_id to config_id+1 (automl#129) * move to a minimization problem (automl#113) * move to a minimization problem * Fix missing test loss file * Missed regression * More robust test * Try signal timeout * Kernel PCA failures * Feedback from Ravin * Better debug msg * Feedback from comments * Doc string request * Feedback from comments * Enhanced doc string * FIX_123 (automl#133) * FIX_123 * Better debug msg * at least 1 config in regression * Return self in _fit() * Adds more examples to customise AutoPyTorch. (automl#124) * 3 examples plus doc update * Forgot the examples * Added example for resampling strategy * Update example worflow * Fixed bugs in example and resampling strategies * Addressed comments * Addressed comments * Addressed comments from shuhei, better documentation * [Feat] Better traditional pipeline cutoff time (automl#141) * [Feat] Better traditional pipeline cutoff time * Fix unit testing * Better failure msg * bug fix catboost * Feedback from Ravin * First batch of feedback from comments * Missed examples * Syntax fix * Hyperparameter Search Space updates now with constant and include ability (automl#146) * In progress, add_hyperparameter * Added SearchSpace working functionality * Working search space update with test for __choice__ and fix flake * fixed mypy bug and bug in making constant float hyperparameters * Add test for fitting pipeline with constant updates * fix flake * bug in int for feature preprocessors and minor bugs in hyperparameter search space fixed * Forgot to add a file * Addressed comments, better documentation and better tests for search space updates * Fix flake * [Bug] Fix random halt problems on traditional pipelines (automl#147) * [feat] Fix random halt problems on traditional pipelines * Documentation update * Fix flake * Flake due to kernel pca errors * Run history traditional (automl#121) * In progress, issue with failed traditional * working traditional classifiers * Addressed comments from francisco * Changed test loop in test_api * Add .autopytorch runs back again * Addressed comments, better documentation and dict for runhistory * Fix flake * Fix tests and add additional run info for crossval * fix tests for train evaluator and api * Addressed comments * Addressed comments * Addressed comments from shuhei, removed deleting from additioninfo * [FIX] Enables backend to track the num run (automl#162) * AA_151 * doc the peek attr * [ADD] Relax constant pipeline performance * [Doc] First push of the developer documentation (automl#127) * First push of the developer documentation * Feedback from Ravin * Document scikit-learn develop guide * Feedback from Ravin * Delete extra point * Refactoring base dataset splitting functions (automl#106) * [Fork from automl#105] Made CrossValFuncs and HoldOutFuncs class to group the functions * Modified time_series_dataset.py to be compatible with resampling_strategy.py * [fix]: back to the renamed version of CROSS_VAL_FN from temporal SplitFunc typing. * fixed flake8 issues in three files * fixed the flake8 issues * [refactor] Address the francisco's comments * [refactor] Adress the francisco's comments * [refactor] Address the doc-string issue in TransformSubset class * [fix] Address flake8 issues * [fix] Fix flake8 issue * [fix] Fix mypy issues raised by github check * [fix] Fix a mypy issue * [fix] Fix a contradiction in holdout_stratified_validation Since stratified splitting requires to shuffle by default and it raises error in the github check, I fixed this issue. * [fix] Address the francisco's review * [fix] Fix a mypy issue tabular_dataset.py * [fix] Address the francisco's comment about the self.dataset_name Since we would to use the dataset name which does not have any name, I decided to get self.dataset_name back to Optional[str]. * [fix] Fix mypy issues * [Fix] Refactor development reproducibility (automl#172) * [Fix] pass random state to randomized algorithms * [Fix] double instantiation of random state * [fix] Flaky for sample configuration * [FIX] Runtime warning * [FIX] hardcoded budget * [FIX] flake * [Fix] try forked * [Fix] try forked * [FIX] budget * [Fix] missing random_state in trainer * [Fix] overwrite in random_state * [FIX] fix seed in splits * [Rebase] * [FIX] Update cv score after split num change * [FIX] CV split * [ADD] Extra visualization example (automl#189) * [ADD] Extra visualization example * Update docs/manual.rst Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update docs/manual.rst Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * [Fix] missing version * Update examples/tabular/40_advanced/example_visualization.py Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * [FIX] make docs more clear to the user Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * [Fix] docs links (automl#201) * [Fix] docs links * Update README.md Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update examples check * Remove tmp in examples Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * [Refactor] Use the backend implementation from automl common (automl#185) * [ADD] First push to enable common backend * Fix unit test * Try public https * [FIX] conftest prefix * [fix] unit test * [FIX] Fix fixture in score * [Fix] pytest collection * [FIX] flake * [FIX] regression also! * Update README.md Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update .gitmodules Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * [FIX] Regression time * Make flaky in case memout doesn't happen * Refacto development automl common backend debug (#2) * [ADD] debug information * [FIX] try fork for more stability Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * [DOC] Adds documentation to the abstract evaluator (automl#160) * DOC_153 * Changes from Ravin * [FIX] improve clarity of msg in commit * [FIX] Update Readme (automl#208) * Reduce run time of the test (automl#205) * In progress, changing te4sts * Reduce time for tests * Fix flake in tests * Patch train in other tests also * Address comments from shuhei and fransisco: * Move base training to pytest * Fix flake in tests * forgot to pass n_samples * stupid error * Address comments from shuhei, remove hardcoding and fix bug in dummy eval function * Skip ensemble test for python >=3.7 and introduce random state for feature processors * fix flake * Remove example workflow * Remove from __init__ in feature preprocessing * [refactor] Getting dataset properties from the dataset object (automl#164) * Use get_required_dataset_info of the dataset when needing required info for getting dataset requirements * Fix flake * Fix bug in getting dataset requirements * Added doc string to explain dataset properties * Update doc string in utils pipeline * Change ubuntu version in docs workflow (automl#237) * Add dist check worflow (automl#238) * [feature] Greedy Portfolio (automl#200) * initial configurations added * In progress, adding flag in search function * Adds documentation, example and fixes setup.py * Address comments from shuhei, change run_greedy to portfolio_selection * address comments from fransisco, movie portfolio to configs * Address comments from fransisco, add tests for greedy portfolio and tests * fix flake tests * Simplify portfolio selection * Update autoPyTorch/optimizer/smbo.py Co-authored-by: Francisco Rivera Valverde <44504424+franchuterivera@users.noreply.github.com> * Address comments from fransisco, path exception handling and test * fix flake * Address comments from shuhei * fix bug in setup.py * fix tests in base trainer evaluate, increase n samples and add seed * fix tests in base trainer evaluate, increase n samples (fix) Co-authored-by: Francisco Rivera Valverde <44504424+franchuterivera@users.noreply.github.com> * [ADD] Forkserver as default multiprocessing strategy (automl#223) * First push of forkserver * [Fix] Missing file * [FIX] mypy * [Fix] renam choice to init * [Fix] Unit test * [Fix] bugs in examples * [Fix] ensemble builder * Update autoPyTorch/pipeline/components/preprocessing/image_preprocessing/normalise/__init__.py Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update autoPyTorch/pipeline/components/preprocessing/image_preprocessing/normalise/__init__.py Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/encoding/__init__.py Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update autoPyTorch/pipeline/components/preprocessing/image_preprocessing/normalise/__init__.py Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/feature_preprocessing/__init__.py Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/scaling/__init__.py Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update autoPyTorch/pipeline/components/setup/network_head/__init__.py Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update autoPyTorch/pipeline/components/setup/network_initializer/__init__.py Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * Update autoPyTorch/pipeline/components/setup/network_embedding/__init__.py Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * [FIX] improve doc-strings * Fix rebase Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> * [ADD] Get incumbent config (automl#175) * In progress get_incumbent_results * [Add] get_incumbent_results to base task, changed additional info in abstract evaluator, and tests * In progress addressing fransisco's comment * Proper check for include_traditional * Fix flake * Mock search of estimator * Fixed path of run history test_api * Addressed comments from Fransisco, making better tests * fix flake * After rebase fix issues * fix flake * Added debug information for API * filtering only successful runs in get_incumbent_results * Address comments from fransisco * Revert changes made to run history assertion in base taks #1257 * fix flake issue * [ADD] Coverage calculation (automl#224) * [ADD] Coverage calculation * [Fix] Flake8 * [fix] rebase artifacts * [Fix] smac reqs * [Fix] Make traditional test robust * [Fix] unit test * [Fix] test_evaluate * [Fix] Try more time for cross validation * Fix mypy post rebase * Fix unit test * [ADD] Pytest schedule (automl#234) * add schedule for pytests workflow * Add ref to development branch * Add scheduled test * update schedule workflow to run on python 3.8 * omit test, examples, workflow from coverage and remove unnecessary code from schedule * Fix call for python3.8 * Fix call for python3.8 (2) * fix code cov call in python 3.8 * Finally fix cov call * [fix] Dropout bug fix (automl#247) * fix dropout bug * fix dropout shape discrepancy * Fix unit test bug * Add tests for dropout shape asper comments from fransisco * Fix flake * Early stop on metric * Enable long run regression Co-authored-by: Ravin Kohli <kohliravin7@gmail.com> Co-authored-by: Ravin Kohli <13005107+ravinkohli@users.noreply.github.com> Co-authored-by: bastiscode <sebastian.walter98@gmail.com> Co-authored-by: nabenabe0928 <shuhei.watanabe.utokyo@gmail.com> Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>

ravinkohli added 5 commits February 5, 2021 13:04

ADD Weighted loss

cc583c1

Now?

2ea059c

Merge branch 'feature_preprocessing' into missing_components

14795cc

Fix tests, flake, mypy

9f0ed18

Fix tests

fb23cef

ravinkohli changed the title ~~ADD Weighted loss~~ Feature preprocessors, Loss strategies Feb 5, 2021

ravinkohli requested a review from franchuterivera February 5, 2021 15:42