Update dataloaders to use Aggregation Lists #264

al-rigazzi · 2023-02-27T22:40:19Z

This PR updates the TF and PyTorch data loaders to make use of SmartRedis's aggregation lists.

When training in parallel, we need to adopt a round-robin distribution of Datasets. This is OK as long as the simulation and the training have the same producing/consuming speed, but if the simulation is way faster (or there are many Datasets in the list when we start to train), we end up making many calls to get the interleaved batches. We could add another parameter in the future, to change the interleaving/stride across ranks, but for now, I think it will be fine.

MattToast

Looks great so far!!

I have a marked a couple of places where I think some methods can be re-written a bit cleaner, and I am asking for a pretty substantial architecture change for the TF and Torch data generators, but overall, the meat of this PR looks fantastic!

AS always feel free to lmk what you think!!

tutorials/ml_training/surrogate/train_surrogate.ipynb

smartsim/ml/data.py

smartsim/ml/torch/data.py

smartsim/ml/data.py

mellis13

Just a couple of comments in addition to Matt's comments

smartsim/ml/data.py

codecov · 2023-03-10T16:25:24Z

Codecov Report

Merging #264 (7959514) into develop (fb967d9) will increase coverage by 2.64%.
The diff coverage is 93.98%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #264      +/-   ##
===========================================
+ Coverage    84.89%   87.53%   +2.64%     
===========================================
  Files           60       60              
  Lines         3423     3386      -37     
===========================================
+ Hits          2906     2964      +58     
+ Misses         517      422      -95

Impacted Files	Coverage Δ
smartsim/_core/control/controller.py	`84.45% <ø> (ø)`
smartsim/_core/control/manifest.py	`94.78% <ø> (ø)`
smartsim/experiment.py	`80.43% <ø> (ø)`
smartsim/ml/data.py	`93.71% <93.70%> (+31.26%)`	⬆️
smartsim/ml/torch/data.py	`95.65% <94.44%> (+38.50%)`	⬆️
smartsim/ml/tf/data.py	`85.36% <94.73%> (+3.54%)`	⬆️
smartsim/ml/__init__.py	`100.00% <100.00%> (ø)`

... and 3 files with indirect coverage changes

MattToast

Looks great!! A found a couple of last minute things to address, but overall this looks about ready to go!

smartsim/ml/data.py

smartsim/ml/tf/data.py

smartsim/ml/torch/data.py

tests/backends/test_dataloader.py

smartsim/ml/tf/data.py

smartsim/ml/data.py

mellis13 · 2023-03-16T00:23:56Z

tutorials/ml_training/surrogate/train_surrogate.ipynb

@@ -41,7 +41,7 @@
   "outputs": [


I tried running the notebook in a new dev environment (numpy 1.24). I got the error below. Downgrading to 1.23 worked. Do you think we should pin to a version or update the heat transfer source files? It seems like the data type in question was deprecated several years ago.

Output exceeds the size limit. Open the full output data in a text editor --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[3], line 14 11 size = 64 13 for _ in range(3): ---> 14 u_s = fd2d_heat_steady_test01 (size, size) 15 pcolor_list(u_s, "Left: initial temperature. Right: steady state.") File ~/craylabs/SmartSim/tutorials/ml_training/surrogate/steady_state.py:270, in fd2d_heat_steady_test01(nx, ny) 267 source_centers = 0.2+np.random.rand(np.random.randint(1,6),2)*0.6 269 Xgrid, Ygrid = np.meshgrid(xvec, yvec) --> 270 u_init = np.zeros_like(Xgrid).astype(np.bool) 271 for center in source_centers: 272 u_init |= (Xgrid-center[0])**2 + (Ygrid-center[1])**2 < 0.05**2 File ~/miniconda3/envs/ss_test/lib/python3.9/site-packages/numpy/__init__.py:305, in __getattr__(attr) 300 warnings.warn( 301 f"In the future `np.{attr}` will be defined as the " 302 "corresponding NumPy scalar.", FutureWarning, stacklevel=2) 304 if attr in __former_attrs__: --> 305 raise AttributeError(__former_attrs__[attr]) 307 # Importing Tester requires importing all of UnitTest which is not a 308 # cheap import Since it is mainly used in test suits, we lazy import it 309 # here to save on the order of 10 ms of import time for most users 310 # ... AttributeError: module 'numpy' has no attribute 'bool'. `np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Was able to confirm, as the deprecation message suggests, that swapping np.bool -> bool appears to give the desired output with both numpy==1.23.0 and numpy==1.24.2. I'm in favor of simply updating the simulation code!

Yeah, that code is old, fixing it won't hurt! Thanks for the catch!

MattToast

LGTM!! Thanks for all the hard work on this one! The use of aggregation lists here make this whole section much easier to understand/utilize/extend imo!

mellis13

LGTM -- thanks!

al-rigazzi added 4 commits February 27, 2023 15:58

First commit

77e56ad

Tests work, need docs

7287875

Update docs

17c3a48

Update tutorial, changelog

766da9a

al-rigazzi marked this pull request as ready for review February 28, 2023 09:00

al-rigazzi assigned MattToast and mellis13 Feb 28, 2023

al-rigazzi added type: refactor Issues focused on refactoring existing code area: ML Issues related to SmartSim ML classes and utilities API break Issues that include incompatible API changes labels Feb 28, 2023

Print output and error for failed dataloader tests

be114e4

al-rigazzi requested review from mellis13 and MattToast February 28, 2023 16:36

MattToast requested changes Mar 6, 2023

View reviewed changes

mellis13 suggested changes Mar 6, 2023

View reviewed changes

smartsim/ml/data.py Show resolved Hide resolved

smartsim/ml/data.py Outdated Show resolved Hide resolved

Address reviewers' feedback

2b8daa2

al-rigazzi requested review from mellis13 and MattToast March 10, 2023 16:26

al-rigazzi added 3 commits March 10, 2023 17:34

Merge branch 'develop' into update_dataloaders

b674caf

Add test for DataInfo.__repr__

7980435

Make style

67c4a9e

MattToast requested changes Mar 15, 2023

View reviewed changes

Address reviewer's comments

6563fb5

mellis13 reviewed Mar 16, 2023

View reviewed changes

al-rigazzi added 2 commits March 16, 2023 12:39

Fix numpy type in sim

c4b53f1

Merge branch 'develop' into update_dataloaders

1cd096c

MattToast approved these changes Mar 16, 2023

View reviewed changes

al-rigazzi added 2 commits March 17, 2023 23:51

Merge branch 'develop' into update_dataloaders

b303f92

Make tests local for coverage

c90c777

al-rigazzi added 14 commits March 19, 2023 17:48

Add test for wrong type

ea4b5f0

Remove unused files

9b9482b

Reduce workers in pytorch dataloader test

afc2f65

Remove train_torch

e9d8902

Augment coverage

e889219

Fix crashing torch dl

b52fbc0

Remove pragma dir

82bedb2

Make pytorch dataloader less aggressive

2c85470

Use main thread for workers

08781ab

Remove prefetch factor

70999e1

Revert last changes

c702914

Use external training process for torch

33e8eca

Fix coverage for unreachable code

cd92c9a

Merge branch 'develop' into update_dataloaders

7959514

mellis13 approved these changes Mar 23, 2023

View reviewed changes

al-rigazzi merged commit 35857f6 into CrayLabs:develop Mar 23, 2023

al-rigazzi deleted the update_dataloaders branch March 23, 2023 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dataloaders to use Aggregation Lists #264

Update dataloaders to use Aggregation Lists #264

al-rigazzi commented Feb 27, 2023 •

edited

Loading

MattToast left a comment

mellis13 left a comment

codecov bot commented Mar 10, 2023 •

edited

Loading

MattToast left a comment

mellis13 Mar 16, 2023

MattToast Mar 16, 2023 •

edited

Loading

al-rigazzi Mar 16, 2023

MattToast left a comment •

edited

Loading

mellis13 left a comment

Update dataloaders to use Aggregation Lists #264

Update dataloaders to use Aggregation Lists #264

Conversation

al-rigazzi commented Feb 27, 2023 • edited Loading

MattToast left a comment

Choose a reason for hiding this comment

mellis13 left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 10, 2023 • edited Loading

Codecov Report

MattToast left a comment

Choose a reason for hiding this comment

mellis13 Mar 16, 2023

Choose a reason for hiding this comment

MattToast Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

al-rigazzi Mar 16, 2023

Choose a reason for hiding this comment

MattToast left a comment • edited Loading

Choose a reason for hiding this comment

mellis13 left a comment

Choose a reason for hiding this comment

al-rigazzi commented Feb 27, 2023 •

edited

Loading

codecov bot commented Mar 10, 2023 •

edited

Loading

MattToast Mar 16, 2023 •

edited

Loading

MattToast left a comment •

edited

Loading