Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

KeyError: 'Message' when encountering an error in _send_metrics #4482

Closed
sziem opened this issue Mar 6, 2024 · 5 comments
Closed

KeyError: 'Message' when encountering an error in _send_metrics #4482

sziem opened this issue Mar 6, 2024 · 5 comments
Assignees

Comments

@sziem
Copy link

sziem commented Mar 6, 2024

Describe the bug
When an error occurs while calling run.log_metric, it does not show the error message, but a KeyError.

To reproduce
It is a bit hard for me to describe this as it occured randomly after working for 42 epochs.

Expected behavior
Get a message of the actual Error cause.

Screenshots or logs

Train epoch 43:  68%|██████▊   | 622/921 [15:36<07:30,  1.51s/it]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ec2-user/train-clrnet/src/man_adt/model_training/train_clrnet/entrypoints/train_clrnet.py", line 69, in main
    runner.train()
  File "/home/ec2-user/train-clrnet/src/man_adt/model_training/train_clrnet/engine/runner.py", line 185, in train
    self._train_epoch(_sagemaker_run)
  File "/home/ec2-user/train-clrnet/src/man_adt/model_training/train_clrnet/engine/runner.py", line 269, in _train_epoch
    _log_training_metrics(
  File "/home/ec2-user/train-clrnet/src/man_adt/model_training/train_clrnet/engine/runner.py", line 406, in _log_training_metrics
    run.log_metric(name="Learning Rate", value=lr, step=step)
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/_utils.py", line 90, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/run.py", line 297, in log_metric
    self._metrics_manager.log_metric(
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/_metrics.py", line 138, in log_metric
    self.sink.log_metric(metric_data)
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/_metrics.py", line 173, in log_metric
    self._drain()
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/_metrics.py", line 187, in _drain
    self._send_metrics(available_metrics)
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/_metrics.py", line 200, in _send_metrics
    message = errors[0]["Message"]
KeyError: 'Message'

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: '2.209.0'
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): -
  • Framework version: -
  • Python version: 3.10
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): N

Additional context

@sziem sziem added the type: bug label Mar 6, 2024
@sziem sziem changed the title KeyError: 'Message KeyError: 'Message' when encountering an error in _send_metrics Mar 6, 2024
@ananth102
Copy link
Collaborator

Hi sziem, are you repeatedly seeing this issue? If so can you share some sample code that we can use to replicate this.

@sziem
Copy link
Author

sziem commented Apr 1, 2024

Hi, thanks for your reply. After seeing this about 2-3 times, I wrapped my calls in a try-except and just ignored it, so I'm not sure if this is still an issue, sorry. Also, it's been a while since I looked at it.

As I said above, it is a bit hard to create a minimal example for the issue, because of the large time delay until it occurs. Unfortunately, I'm not at liberty to share my code. But the way I've been using log_metrics is like this:

import boto3
from sagemaker.experiments import run
from sagemaker.session import Session

# sagemaker_session = Session(boto_session=boto3.Session(...))

my_run = run.Run(
    experiment_name="experiment_foo",
    run_name="run_foo",
    tags={"tag_key": "tag_value"},
    sagemaker_session=sagemaker_session,
)

n_steps = 1000000
lr = 0.0001
with my_run as my_run_ :
    for step in range(steps):
        my_run_.log_metric(name="Learning Rate", value=lr, step=step)

Then there must have been something (maybe a connection error?) that caused send_metrics to fail at some point.

@ananth102
Copy link
Collaborator

Seems like an issue with the sdk.

message = errors[0]["Message"]

This statement needs to reference "Code" instead of "Message". As that is what the api returns (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-metrics/client/batch_put_metrics.html)

It would still error out in the next line:

raise Exception(f'{len(errors)} errors with message "{message}"')

but the error message would be more helpful.

@sziem
Copy link
Author

sziem commented Apr 3, 2024

Yes I agree. That should be the fix and the correct behavior.

@pintaoz-aws
Copy link
Contributor

pintaoz-aws commented Feb 28, 2025

Fixed by #5068

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants