-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
added hyperparameter advanced tutorial #69
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @Priyansi !
I've added some suggestions related to coding style.
" trainset = CIFAR10(\n", | ||
" root=data_dir, train=True, download=True, transform=transform)\n", | ||
"\n", | ||
" testset = CIFAR10(\n", | ||
" root=data_dir, train=False, download=True, transform=transform)\n", | ||
"\n", | ||
" return trainset, testset" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" trainset = CIFAR10(\n", | |
" root=data_dir, train=True, download=True, transform=transform)\n", | |
"\n", | |
" testset = CIFAR10(\n", | |
" root=data_dir, train=False, download=True, transform=transform)\n", | |
"\n", | |
" return trainset, testset" | |
" trainset = CIFAR10(\n", | |
" root=data_dir, train=True, download=True, transform=transform\n", | |
" )\n", | |
" testset = CIFAR10(\n", | |
" root=data_dir, train=False, download=True, transform=transform\n", | |
" )\n", | |
" return trainset, testset" |
" train_subset, val_subset = random_split(\n", | ||
" trainset, [test_abs, len(trainset) - test_abs])\n", | ||
"\n", | ||
" trainloader = idist.auto_dataloader(\n", | ||
" train_subset,\n", | ||
" batch_size=int(config[\"batch_size\"]),\n", | ||
" shuffle=True,\n", | ||
" num_workers=8)\n", | ||
" valloader = idist.auto_dataloader(\n", | ||
" val_subset,\n", | ||
" batch_size=int(config[\"batch_size\"]),\n", | ||
" shuffle=True,\n", | ||
" num_workers=8)\n", | ||
" \n", | ||
" return trainloader, valloader" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" train_subset, val_subset = random_split(\n", | |
" trainset, [test_abs, len(trainset) - test_abs])\n", | |
"\n", | |
" trainloader = idist.auto_dataloader(\n", | |
" train_subset,\n", | |
" batch_size=int(config[\"batch_size\"]),\n", | |
" shuffle=True,\n", | |
" num_workers=8)\n", | |
" valloader = idist.auto_dataloader(\n", | |
" val_subset,\n", | |
" batch_size=int(config[\"batch_size\"]),\n", | |
" shuffle=True,\n", | |
" num_workers=8)\n", | |
" \n", | |
" return trainloader, valloader" | |
" train_subset, val_subset = random_split(\n", | |
" trainset, [test_abs, len(trainset) - test_abs]\n", | |
" )\n", | |
" trainloader = idist.auto_dataloader(\n", | |
" train_subset,\n", | |
" batch_size=int(config[\"batch_size\"]),\n", | |
" shuffle=True,\n", | |
" num_workers=8\n", | |
" )\n", | |
" valloader = idist.auto_dataloader(\n", | |
" val_subset,\n", | |
" batch_size=int(config[\"batch_size\"]),\n", | |
" shuffle=True,\n", | |
" num_workers=8\n", | |
" )\n", | |
" return trainloader, valloader" |
"def initialize(config, checkpoint_dir):\n", | ||
" model = idist.auto_model(Net(config[\"l1\"], config[\"l2\"]))\n", | ||
"\n", | ||
" device = idist.device()\n", | ||
"\n", | ||
" criterion = nn.CrossEntropyLoss()\n", | ||
" optimizer = idist.auto_optim(optim.SGD(model.parameters(), lr=config[\"lr\"], momentum=0.9))\n", | ||
"\n", | ||
" if checkpoint_dir:\n", | ||
" model_state, optimizer_state = torch.load(\n", | ||
" os.path.join(checkpoint_dir, \"checkpoint\"))\n", | ||
" model.load_state_dict(model_state)\n", | ||
" optimizer.load_state_dict(optimizer_state)\n", | ||
" \n", | ||
" return model, device, criterion, optimizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"def initialize(config, checkpoint_dir):\n", | |
" model = idist.auto_model(Net(config[\"l1\"], config[\"l2\"]))\n", | |
"\n", | |
" device = idist.device()\n", | |
"\n", | |
" criterion = nn.CrossEntropyLoss()\n", | |
" optimizer = idist.auto_optim(optim.SGD(model.parameters(), lr=config[\"lr\"], momentum=0.9))\n", | |
"\n", | |
" if checkpoint_dir:\n", | |
" model_state, optimizer_state = torch.load(\n", | |
" os.path.join(checkpoint_dir, \"checkpoint\"))\n", | |
" model.load_state_dict(model_state)\n", | |
" optimizer.load_state_dict(optimizer_state)\n", | |
" \n", | |
" return model, device, criterion, optimizer" | |
"def initialize(config, checkpoint_dir):\n", | |
" model = idist.auto_model(Net(config[\"l1\"], config[\"l2\"]))\n", | |
"\n", | |
" device = idist.device()\n", | |
"\n", | |
" criterion = nn.CrossEntropyLoss()\n", | |
" optimizer = idist.auto_optim(\n", | |
" optim.SGD(model.parameters(), lr=config[\"lr\"], momentum=0.9)\n", | |
" )\n", | |
"\n", | |
" if checkpoint_dir:\n", | |
" model_state, optimizer_state = torch.load(\n", | |
" os.path.join(checkpoint_dir, \"checkpoint\")\n", | |
" )\n", | |
" model.load_state_dict(model_state)\n", | |
" optimizer.load_state_dict(optimizer_state)\n", | |
"\n", | |
" return model, device, criterion, optimizer" |
"def train_cifar(config, data_dir=None, checkpoint_dir=None):\n", | ||
" trainloader, valloader = get_train_val_loaders(config, data_dir)\n", | ||
" model, device, criterion, optimizer = initialize(config, checkpoint_dir)\n", | ||
" \n", | ||
" trainer = create_supervised_trainer(model, optimizer, criterion, device=device, non_blocking=True)\n", | ||
" \n", | ||
" avg_output = RunningAverage(output_transform=lambda x: x)\n", | ||
" avg_output.attach(trainer, 'running_avg_loss')\n", | ||
" \n", | ||
" val_evaluator = create_supervised_evaluator(model, metrics={ \"accuracy\": Accuracy(), \"loss\": Loss(criterion)}, device=device, non_blocking=True)\n", | ||
" \n", | ||
" @trainer.on(Events.ITERATION_COMPLETED(every=2000))\n", | ||
" def log_training_loss(engine):\n", | ||
" print(f\"Epoch[{engine.state.epoch}], Iter[{engine.state.iteration}] Loss: {engine.state.output:.2f} Running Avg Loss: {engine.state.metrics['running_avg_loss']:.2f}\")\n", | ||
"\n", | ||
"\n", | ||
" @trainer.on(Events.EPOCH_COMPLETED)\n", | ||
" def log_validation_results(trainer):\n", | ||
" val_evaluator.run(valloader)\n", | ||
" metrics = val_evaluator.state.metrics\n", | ||
" print(f\"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}\")\n", | ||
"\n", | ||
" with tune.checkpoint_dir(trainer.state.epoch) as checkpoint_dir:\n", | ||
" path = os.path.join(checkpoint_dir, \"checkpoint\")\n", | ||
" torch.save((model.state_dict(), optimizer.state_dict()), path)\n", | ||
" \n", | ||
" tune.report(loss=metrics['loss'], accuracy=metrics['accuracy']) \n", | ||
"\n", | ||
" trainer.run(trainloader, max_epochs=10) " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"def train_cifar(config, data_dir=None, checkpoint_dir=None):\n", | |
" trainloader, valloader = get_train_val_loaders(config, data_dir)\n", | |
" model, device, criterion, optimizer = initialize(config, checkpoint_dir)\n", | |
" \n", | |
" trainer = create_supervised_trainer(model, optimizer, criterion, device=device, non_blocking=True)\n", | |
" \n", | |
" avg_output = RunningAverage(output_transform=lambda x: x)\n", | |
" avg_output.attach(trainer, 'running_avg_loss')\n", | |
" \n", | |
" val_evaluator = create_supervised_evaluator(model, metrics={ \"accuracy\": Accuracy(), \"loss\": Loss(criterion)}, device=device, non_blocking=True)\n", | |
" \n", | |
" @trainer.on(Events.ITERATION_COMPLETED(every=2000))\n", | |
" def log_training_loss(engine):\n", | |
" print(f\"Epoch[{engine.state.epoch}], Iter[{engine.state.iteration}] Loss: {engine.state.output:.2f} Running Avg Loss: {engine.state.metrics['running_avg_loss']:.2f}\")\n", | |
"\n", | |
"\n", | |
" @trainer.on(Events.EPOCH_COMPLETED)\n", | |
" def log_validation_results(trainer):\n", | |
" val_evaluator.run(valloader)\n", | |
" metrics = val_evaluator.state.metrics\n", | |
" print(f\"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}\")\n", | |
"\n", | |
" with tune.checkpoint_dir(trainer.state.epoch) as checkpoint_dir:\n", | |
" path = os.path.join(checkpoint_dir, \"checkpoint\")\n", | |
" torch.save((model.state_dict(), optimizer.state_dict()), path)\n", | |
" \n", | |
" tune.report(loss=metrics['loss'], accuracy=metrics['accuracy']) \n", | |
"\n", | |
" trainer.run(trainloader, max_epochs=10) " | |
"def train_cifar(config, data_dir=None, checkpoint_dir=None):\n", | |
" trainloader, valloader = get_train_val_loaders(config, data_dir)\n", | |
" model, device, criterion, optimizer = initialize(config, checkpoint_dir)\n", | |
"\n", | |
" trainer = create_supervised_trainer(\n", | |
" model, optimizer, criterion, device=device, non_blocking=True\n", | |
" )\n", | |
"\n", | |
" avg_output = RunningAverage(output_transform=lambda x: x)\n", | |
" avg_output.attach(trainer, \"running_avg_loss\")\n", | |
"\n", | |
" val_evaluator = create_supervised_evaluator(\n", | |
" model,\n", | |
" metrics={\"accuracy\": Accuracy(), \"loss\": Loss(criterion)},\n", | |
" device=device,\n", | |
" non_blocking=True,\n", | |
" )\n", | |
"\n", | |
" @trainer.on(Events.ITERATION_COMPLETED(every=2000))\n", | |
" def log_training_loss(engine):\n", | |
" print(\n", | |
" f\"Epoch[{engine.state.epoch}], Iter[{engine.state.iteration}] Loss: {engine.state.output:.2f} Running Avg Loss: {engine.state.metrics['running_avg_loss']:.2f}\"\n", | |
" )\n", | |
"\n", | |
" @trainer.on(Events.EPOCH_COMPLETED)\n", | |
" def log_validation_results(trainer):\n", | |
" val_evaluator.run(valloader)\n", | |
" metrics = val_evaluator.state.metrics\n", | |
" print(\n", | |
" f\"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}\"\n", | |
" )\n", | |
"\n", | |
" with tune.checkpoint_dir(trainer.state.epoch) as checkpoint_dir:\n", | |
" path = os.path.join(checkpoint_dir, \"checkpoint\")\n", | |
" torch.save((model.state_dict(), optimizer.state_dict()), path)\n", | |
"\n", | |
" tune.report(loss=metrics[\"loss\"], accuracy=metrics[\"accuracy\"])\n", | |
"\n", | |
" trainer.run(trainloader, max_epochs=10)" |
"def test_best_model(best_trial, data_dir=None):\n", | ||
" _, testset = load_data(data_dir)\n", | ||
" \n", | ||
" best_trained_model = idist.auto_model(Net(best_trial.config[\"l1\"], best_trial.config[\"l2\"]))\n", | ||
" device = idist.device()\n", | ||
"\n", | ||
" best_checkpoint_dir = best_trial.checkpoint.value\n", | ||
" model_state, optimizer_state = torch.load(os.path.join(\n", | ||
" best_checkpoint_dir, \"checkpoint\"))\n", | ||
" best_trained_model.load_state_dict(model_state)\n", | ||
"\n", | ||
" test_evaluator = create_supervised_evaluator(best_trained_model, metrics={\"Accuracy\": Accuracy()}, device=device, non_blocking=True)\n", | ||
"\n", | ||
" testloader = idist.auto_dataloader(testset, batch_size=4, shuffle=False, num_workers=2)\n", | ||
"\n", | ||
" test_evaluator.run(testloader)\n", | ||
" print(f\"Best trial test set accuracy: {test_evaluator.state.metrics}\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"def test_best_model(best_trial, data_dir=None):\n", | |
" _, testset = load_data(data_dir)\n", | |
" \n", | |
" best_trained_model = idist.auto_model(Net(best_trial.config[\"l1\"], best_trial.config[\"l2\"]))\n", | |
" device = idist.device()\n", | |
"\n", | |
" best_checkpoint_dir = best_trial.checkpoint.value\n", | |
" model_state, optimizer_state = torch.load(os.path.join(\n", | |
" best_checkpoint_dir, \"checkpoint\"))\n", | |
" best_trained_model.load_state_dict(model_state)\n", | |
"\n", | |
" test_evaluator = create_supervised_evaluator(best_trained_model, metrics={\"Accuracy\": Accuracy()}, device=device, non_blocking=True)\n", | |
"\n", | |
" testloader = idist.auto_dataloader(testset, batch_size=4, shuffle=False, num_workers=2)\n", | |
"\n", | |
" test_evaluator.run(testloader)\n", | |
" print(f\"Best trial test set accuracy: {test_evaluator.state.metrics}\")" | |
"def test_best_model(best_trial, data_dir=None):\n", | |
" _, testset = load_data(data_dir)\n", | |
"\n", | |
" best_trained_model = idist.auto_model(\n", | |
" Net(best_trial.config[\"l1\"], best_trial.config[\"l2\"])\n", | |
" )\n", | |
" device = idist.device()\n", | |
"\n", | |
" best_checkpoint_dir = best_trial.checkpoint.value\n", | |
" model_state, optimizer_state = torch.load(\n", | |
" os.path.join(best_checkpoint_dir, \"checkpoint\")\n", | |
" )\n", | |
" best_trained_model.load_state_dict(model_state)\n", | |
"\n", | |
" test_evaluator = create_supervised_evaluator(\n", | |
" best_trained_model,\n", | |
" metrics={\"Accuracy\": Accuracy()},\n", | |
" device=device,\n", | |
" non_blocking=True,\n", | |
" )\n", | |
"\n", | |
" testloader = idist.auto_dataloader(\n", | |
" testset, batch_size=4, shuffle=False, num_workers=2\n", | |
" )\n", | |
"\n", | |
" test_evaluator.run(testloader)\n", | |
" print(f\"Best trial test set accuracy: {test_evaluator.state.metrics}\")" |
"def main(num_samples=10, max_num_epochs=10, gpus_per_trial=1):\n", | ||
" data_dir = os.path.abspath(\"./data\")\n", | ||
" load_data(data_dir)\n", | ||
" \n", | ||
" config = {\n", | ||
" \"l1\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n", | ||
" \"l2\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n", | ||
" \"lr\": tune.loguniform(1e-4, 1e-1),\n", | ||
" \"batch_size\": tune.choice([2, 4, 8, 16])\n", | ||
" }\n", | ||
" scheduler = ASHAScheduler(\n", | ||
" metric=\"loss\",\n", | ||
" mode=\"min\",\n", | ||
" max_t=max_num_epochs,\n", | ||
" grace_period=1,\n", | ||
" reduction_factor=2)\n", | ||
" reporter = CLIReporter(\n", | ||
" metric_columns=[\"loss\", \"accuracy\", \"training_iteration\"])\n", | ||
" result = tune.run(\n", | ||
" partial(train_cifar, data_dir=data_dir),\n", | ||
" resources_per_trial={\"cpu\": 2, \"gpu\": gpus_per_trial},\n", | ||
" config=config,\n", | ||
" num_samples=num_samples,\n", | ||
" scheduler=scheduler,\n", | ||
" progress_reporter=reporter)\n", | ||
"\n", | ||
" best_trial = result.get_best_trial(\"loss\", \"min\", \"last\")\n", | ||
" print(f\"Best trial config: {best_trial.config}\")\n", | ||
" print(f\"Best trial final validation loss: {best_trial.last_result['loss']}\")\n", | ||
" print(f\"Best trial final validation accuracy: {best_trial.last_result['accuracy']}\")\n", | ||
" \n", | ||
" test_best_model(best_trial, data_dir)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"def main(num_samples=10, max_num_epochs=10, gpus_per_trial=1):\n", | |
" data_dir = os.path.abspath(\"./data\")\n", | |
" load_data(data_dir)\n", | |
" \n", | |
" config = {\n", | |
" \"l1\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n", | |
" \"l2\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n", | |
" \"lr\": tune.loguniform(1e-4, 1e-1),\n", | |
" \"batch_size\": tune.choice([2, 4, 8, 16])\n", | |
" }\n", | |
" scheduler = ASHAScheduler(\n", | |
" metric=\"loss\",\n", | |
" mode=\"min\",\n", | |
" max_t=max_num_epochs,\n", | |
" grace_period=1,\n", | |
" reduction_factor=2)\n", | |
" reporter = CLIReporter(\n", | |
" metric_columns=[\"loss\", \"accuracy\", \"training_iteration\"])\n", | |
" result = tune.run(\n", | |
" partial(train_cifar, data_dir=data_dir),\n", | |
" resources_per_trial={\"cpu\": 2, \"gpu\": gpus_per_trial},\n", | |
" config=config,\n", | |
" num_samples=num_samples,\n", | |
" scheduler=scheduler,\n", | |
" progress_reporter=reporter)\n", | |
"\n", | |
" best_trial = result.get_best_trial(\"loss\", \"min\", \"last\")\n", | |
" print(f\"Best trial config: {best_trial.config}\")\n", | |
" print(f\"Best trial final validation loss: {best_trial.last_result['loss']}\")\n", | |
" print(f\"Best trial final validation accuracy: {best_trial.last_result['accuracy']}\")\n", | |
" \n", | |
" test_best_model(best_trial, data_dir)" | |
"def main(num_samples=10, max_num_epochs=10, gpus_per_trial=1):\n", | |
" data_dir = os.path.abspath(\"./data\")\n", | |
" load_data(data_dir)\n", | |
"\n", | |
" config = {\n", | |
" \"l1\": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),\n", | |
" \"l2\": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),\n", | |
" \"lr\": tune.loguniform(1e-4, 1e-1),\n", | |
" \"batch_size\": tune.choice([2, 4, 8, 16]),\n", | |
" }\n", | |
" scheduler = ASHAScheduler(\n", | |
" metric=\"loss\",\n", | |
" mode=\"min\",\n", | |
" max_t=max_num_epochs,\n", | |
" grace_period=1,\n", | |
" reduction_factor=2,\n", | |
" )\n", | |
" reporter = CLIReporter(metric_columns=[\"loss\", \"accuracy\", \"training_iteration\"])\n", | |
" result = tune.run(\n", | |
" partial(train_cifar, data_dir=data_dir),\n", | |
" resources_per_trial={\"cpu\": 2, \"gpu\": gpus_per_trial},\n", | |
" config=config,\n", | |
" num_samples=num_samples,\n", | |
" scheduler=scheduler,\n", | |
" progress_reporter=reporter,\n", | |
" )\n", | |
"\n", | |
" best_trial = result.get_best_trial(\"loss\", \"min\", \"last\")\n", | |
" print(f\"Best trial config: {best_trial.config}\")\n", | |
" print(f\"Best trial final validation loss: {best_trial.last_result['loss']}\")\n", | |
" print(f\"Best trial final validation accuracy: {best_trial.last_result['accuracy']}\")\n", | |
"\n", | |
" test_best_model(best_trial, data_dir)" |
What about adding some summarizing sentences about the best trial, how to interpret the results? |
"For every trial, Ray Tune will randomly sample a combination of parameters from these search spaces. It will then train a number of models in parallel and find the best performing one among these. \n", | ||
"We also use the `ASHAScheduler()` which is one of the trial schedulers that aggressively terminate low-performing trials.\n", | ||
"Apart from that, we leverage the `CLIReporter()` to prettify our outputs.\n", | ||
"And then, we wrap `train_cifar` in functools.partial and pass it to `tune.run` along with other resources like the CPUs and GPUs available to use, the configurable parameters, the number of trials, scheduler and reporter.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
"And then, we wrap `train_cifar` in functools.partial and pass it to `tune.run` along with other resources like the CPUs and GPUs available to use, the configurable parameters, the number of trials, scheduler and reporter.\n", | |
"And then, we wrap `train_cifar` in `functools.partial` and pass it to `tune.run` along with other resources like the CPUs and GPUs available to use, the configurable parameters, the number of trials, scheduler and reporter.\n", |
"id": "vJgTaKWU8Doq" | ||
}, | ||
"source": [ | ||
"In this tutorial, we will see how [Ray Tune](https://docs.ray.io/en/stable/tune.html) can be used with Ignite for hyperparameter tuning. We will also compare it with other frameworks like [Optuna](https://optuna.org/) and [Ax](https://ax.dev/) for hyperparameter optimization.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we going to add comparisons with ax and optuna?
Fixes #29