Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ENH Add update metadata to repocard #844

Merged
merged 24 commits into from
May 9, 2022
Merged

ENH Add update metadata to repocard #844

merged 24 commits into from
May 9, 2022

Conversation

lvwerra
Copy link
Member

@lvwerra lvwerra commented Apr 19, 2022

This PR adds a metadata_update function that allows the user to update the metadata in a repository on the hub. The function accepts a dict with metadata (following the same pattern as the YAML in the README) and behaves as follows for all top level fields except model-index:

  • if field in existing README does not exist it is added.
  • if it exists an error is thrown except if overwrite=True is passed as a safety guard.

For model-index the behaviour is more nuanced:

  • if an entry with the same task and dataset exist, then
    • if the same metric type/name does not exist the metric is appended to the list
    • if the same metric type/name exists the value is overwritten (given overwrite=True)
  • if an entry with the same task and dataset does not exist, the result is appended to the results

For reference this is an example of a model's metadata structure as a dictionary:

{'datasets': ['lvwerra/codeparrot-clean-train'],
 'language': 'code',
 'model-index': [{'name': 'codeparrot',
                  'results': [{'dataset': {'name': 'HumanEval',
                                           'type': 'openai_humaneval'},
                               'metrics': [{'name': 'pass@1',
                                            'type': 'code_eval',
                                            'value': 3.99},
                                           {'name': 'pass@10',
                                            'type': 'code_eval',
                                            'value': 8.69},
                                           {'name': 'pass@100',
                                            'type': 'code_eval',
                                            'value': 17.88}],
                               'task': {'name': 'Code Generation',
                                        'type': 'code-generation'}}]}],
 'tags': ['code', 'gpt2', 'generation'],
 'widget': [{'example_title': 'Transformers',
             'text': 'from transformer import'},
            {'example_title': 'Hello World!',
             'text': 'def print_hello_world():\n\t'},
            {'example_title': 'File size',
             'text': 'def get_file_size(filepath):'},
            {'example_title': 'Numpy', 'text': 'import numpy as'}]}

One minor issue I found is that I need to use force_download=True for the tests to pass as otherwise the hf_hub_download uses the cached but outdated version of the README even if the README has been updated on the remote. cc @LysandreJik

force_download=True,

This feature will be used for huggingface/evaluate#6 and closes #835.

@lvwerra lvwerra requested review from julien-c and osanseviero April 19, 2022 10:09
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 19, 2022

The documentation is not available anymore as the PR was closed or merged.

@julien-c
Copy link
Member

the behavior sounds reasonable to me, but isn't it weird in terms of API design that there's a special case for model-index? (maybe not)

@lvwerra
Copy link
Member Author

lvwerra commented Apr 19, 2022

The main reason to do it different for model-index is usability: if we take the same approach as for the other fields one would need to always pass all the existing results/metrics too, even if for just updating/adding a single entry.

Alternatively, we could outsource that logic to a helper function that grabs the existing model-index and updates it with the new results and then one can pass the complete (existing+new) model-index to metadata_update.

Copy link
Contributor

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @lvwerra .

In the future, please send PRs from your own branch. We're trying to cleanup branches on the main repo. The CI works here on PRs from fork repositories.

@adrinjalali adrinjalali changed the title Add update metadata ENH Add update metadata to repocard Apr 21, 2022
lvwerra and others added 2 commits April 21, 2022 13:43
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@lvwerra
Copy link
Member Author

lvwerra commented Apr 21, 2022

Code Update

I refactored the _update_metadata_model_index function:

  • outsourced the inner for-loops
  • documented the functions
  • added the unique identifying features as a configurable list since they might change (e.g. adding dataset args or configs)

Cases

Given the following example for existing metadata on the hub under model-index there are three cases:

existing_results = [{'dataset': {'name': 'IMDb', 'type': 'imdb'},
                                 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.995}],
                                 'task': {'name': 'Text Classification', 'type': 'text-classification'}}]

1 Overwrite existing metric value in existing result

This happens if the values of 'dataset' and 'task' are equal as well as 'name' and 'value' of the metric. This requires overwrite=True otherwise throws ValueError:

new_results = deepcopy(existing_results)
new_results[0]["metrics"][0]["value"] = 0.999
_update_metadata_model_index(existing_results, new_results, overwrite=True)

# result:
[{'dataset': {'name': 'IMDb', 'type': 'imdb'},
  'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.999}],
  'task': {'name': 'Text Classification', 'type': 'text-classification'}}]

2 Add new metric to existing result

This happens if the values of 'dataset' and 'task' are equal but 'name' and 'value' of the metric are not:

new_results = deepcopy(existing_results)
new_results[0]["metrics"][0]["name"] = "Recall"
new_results[0]["metrics"][0]["type"] = "recall"

# result:
[{'dataset': {'name': 'IMDb', 'type': 'imdb'},
  'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.995},
              {'name': 'Recall', 'type': 'recall', 'value': 0.995}],
  'task': {'name': 'Text Classification', 'type': 'text-classification'}}]

3 Add new result

This happens if not both values of 'dataset' and 'task' match with the new result:

new_results = deepcopy(existing_results)
new_results[0]["dataset"] = {'name': 'IMDb-2', 'type': 'imdb_2'}

# result:
[{'dataset': {'name': 'IMDb', 'type': 'imdb'},
  'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.995}],
  'task': {'name': 'Text Classification', 'type': 'text-classification'}},
 {'dataset': ({'name': 'IMDb-2', 'type': 'imdb_2'},),
  'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.995}],
  'task': {'name': 'Text Classification', 'type': 'text-classification'}}]

Hope that clarifies what _update_metadata_model_index is supposed to do.

Comment on lines +198 to +214
for new_result in new_results:
result_found = False
for existing_result_index, existing_result in enumerate(existing_results):
if all(
new_result[feat] == existing_result[feat]
for feat in UNIQUE_RESULT_FEATURES
):
result_found = True
existing_results[existing_result_index][
"metrics"
] = _update_metadata_results_metric(
new_result["metrics"],
existing_result["metrics"],
overwrite=overwrite,
)
if not result_found:
existing_results.append(new_result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since dictionaries are mutable, you could also write this as:

        try:
            existing_result = next(
                x
                for x in existing_results
                if all(x[feat] == new_result[feat] for feat in UNIQUE_RESULT_FEATURES)
            )
            existing_result["metrics"] = _update_metadata_results_metric(
                new_result["metrics"],
                existing_result["metrics"],
                overwrite=overwrite,
            )
        except StopIteration:
            existing_results.append(new_result)

which should be faster since it avoids one slow for loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this.

Comment on lines 236 to 258
for new_metric in new_metrics:
metric_exists = False
for existing_metric_index, existing_metric in enumerate(existing_metrics):
if all(
new_metric[feat] == existing_metric[feat]
for feat in UNIQUE_METRIC_FEATURES
):
if overwrite:
existing_metrics[existing_metric_index]["value"] = new_metric[
"value"
]
else:
# if metric exists and value is not the same throw an error without overwrite flag
if (
existing_metrics[existing_metric_index]["value"]
!= new_metric["value"]
):
raise ValueError(
f"""You passed a new value for the existing metric '{new_metric["name"]}'. Set `overwrite=True` to overwrite existing metrics."""
)
metric_exists = True
if not metric_exists:
existing_metrics.append(new_metric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do the same here:

    for new_metric in new_metrics:
        try:
            existing_metric = next(
                x
                for x in existing_metrics
                if all(x[feat] == new_metric[feat] for feat in UNIQUE_METRIC_FEATURES)
            )
            if overwrite:
                existing_metric["value"] = new_metric["value"]
            else:
                # if metric exists and value is not the same throw an error
                # without overwrite flag
                if existing_metric["value"] != new_metric["value"]:
                    existing_str = ",".join(
                        new_metric[feat] for feat in UNIQUE_METRIC_FEATURES
                    )
                    raise ValueError(
                        "You passed a new value for the existing metric"
                        f" '{existing_str}'. Set `overwrite=True` to overwrite existing"
                        " metrics."
                    )
        except StopIteration:
            existing_metrics.append(new_metric)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This as well :)

Comment on lines 191 to 192
if os.path.exists(REPO_NAME):
shutil.rmtree(REPO_NAME, onerror=set_write_permission_and_retry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we clone the repo under a tempfile.mkdtemp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also integrated tempfile.mkdtemp everywhere.

Copy link
Contributor

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could you please merge with the latest main and run black on these files again? Otherwise LGTM.

@lvwerra
Copy link
Member Author

lvwerra commented Apr 26, 2022

Done - thanks for reviewing and the helpful suggestions @adrinjalali!

@adrinjalali
Copy link
Contributor

Nice. I'll wait for @osanseviero to have a look and merge.

Copy link
Contributor

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very neat! I left some minor suggestions and a couple of questions. Thanks for this PR 🔥 🔥

I wrote a colab while I was exploring this https://colab.research.google.com/drive/1fG8OWYTnI6ucnafYKtrf-HwxtWbVBVQh?usp=sharing

existing_result = next(
x
for x in existing_results
if all(x[feat] == new_result[feat] for feat in UNIQUE_RESULT_FEATURES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes metadata in existing data is valid and has both datasets and task tags, it will break if it's not valid. How should we handle those scenarios? Should we validate model-index beforehand? Show an error message? Override invalid metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we pull the metadata from the Hub can't we assume that it is valid? In your experiment pushing invalid metadata was rejected, right? So it should not be there in the first place if it is invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm @julien-c WDYT? https://huggingface.co/osanseviero/llama-horse-zebra

The result has task and metrics, but no associated dataset. This is not rejected by the server, and the evaluation results are nicely shown, with the only con that there is no associated dataset, so there's no leaderboard.

IMO this is still valid metadata which is just incomplete. Most spaCy models are like this https://huggingface.co/models?library=spacy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we try to still display the eval results in this case even though it's not perfect

Comment on lines 210 to 211
except StopIteration:
existing_results.append(new_result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? I think try/except to do this is a bit less readable and could be more prone to introduce bugs. Maybe the for loop could be more explicit. I think that was how it was before, I personally find that more readable and easier to maintain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinion here - happy to revert. @adrinjalali?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for loop is slower, and try/except clauses are quite pythonic. I do think the current state is quite better than a for loop. One can always add a bit of comment on when the exception is raised if that helps with readability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think speed is important here as we iterate through a dictionary with a handful of entries while pulling/pushing to the hub in the same function which is probably 1000x slower.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not optimize for execution speed of an update method that won't be called thousands in time per second at the moment, and would rather optimize for readability and consistency with the rest of the codebase. This implem iterates over new_results with a for loop, and over existing_results with a try/except/next approach. Should we start changing all nested loops to match this?

If you feel strong about this let's go with it though 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do find the for/loop implementation less readable. Removing the outer loop would make it quite complicated and not very readable, that's why I didn't suggest to remove it.

As for consistency, I don't think it's about staying consistent with the rest of the code. Sometimes a for loop makes sense, sometimes it doesn't. And as a team when we work on a codebase, sometimes we find better ways to do things, and that's fine, and I don't think we should hold back because we have done things in a certain way so far. To me it's okay to look at a codebase and notice old stuff vs new stuff. The way people do things changes and new code can look different than the old code and that's fine to me.

But if both of you think the for loop is more readable than this, then sure, change it. To me it's the other way around.

@lvwerra
Copy link
Member Author

lvwerra commented May 3, 2022

I integrated the feedback and reverted back to for-loops for now. Can change it again if somebody has strong opinions. Let me know if this looks good now @osanseviero @adrinjalali.

Copy link
Contributor

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Let's wait until other tests (unrelated to this PR ) to be fixed to submit on green 🚀

Thanks!

@LysandreJik LysandreJik added this to the v0.6 milestone May 9, 2022
@LysandreJik LysandreJik self-requested a review May 9, 2022 12:20
Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few small nits, other than that looks good to me!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Only left nits.

repo_type (`str`, *optional*):
Set to `"dataset"` or `"space"` if updating to a dataset or space,
`None` or `"model"` if updating to a model. Default is `None`.
overwrite (`bool`, *optional*):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
overwrite (`bool`, *optional*):
overwrite (`bool`, *optional*, defaults to `False`):

lvwerra and others added 4 commits May 9, 2022 18:49
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
@LysandreJik
Copy link
Member

Thank you for addressing all comments! Merging.

@LysandreJik LysandreJik merged commit f6343cb into main May 9, 2022
@LysandreJik LysandreJik deleted the add-update-metadata branch May 9, 2022 17:39
LysandreJik added a commit that referenced this pull request May 24, 2022
* add `metadata_update` function

* add tests

* add docstring

* Apply suggestions from code review

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

* refactore `_update_metadata_model_index`

* Apply suggestions from code review

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

* fix style and imports

* switch to deepcopy everywhere

* load repo in repocard test into tmp folder

* simplify results and metrics checks when updating metadata

* run black

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* fix pyyaml version to work with `sort_keys` kwarg

* don't allow empty commits if file hasn't  changed

* switch order of updates to first check model-index for easier readbility

* expose repocard functions through `__init__`

* fix init

* make style & quality

* revert to for-loop

* Apply suggestions from code review

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

* post suggestion fixes

* add example

* add type to list

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose a function to update the meta data in the readme
6 participants