Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Should throw an error if repo_id is not valid #1008

Closed
Wauplin opened this issue Aug 23, 2022 · 8 comments
Closed

Should throw an error if repo_id is not valid #1008

Wauplin opened this issue Aug 23, 2022 · 8 comments

Comments

@Wauplin
Copy link
Contributor

Wauplin commented Aug 23, 2022

Related to huggingface/transformers#18691.

Currently if repo_id value is set to a wrong value, there is no validation before calling cache/the hub. Typically, if the value is a path "./resources/ltp" (by mistake from the user) in hf_hub_download, then it will raise a FileNotFoundError:

FileNotFoundError: [Errno 2] No such file or directory: '/home/chaizhihua/.cache/huggingface/hub/models--.--resources--ltp/refs/main'

It would be good to have a helper to validate a repo_id and raise a custom exception if it's obvious than it is not a valid repo_id. As a first step I would consider any value in the form "xxx/xxx" or "xxx".

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 23, 2022

EDIT: I just tried to create a repo on the hub with a weird value and got this warning. Those rules can be easily hardcoded in huggingface_hub but I am not 100% sure it's a good idea in case rules change on the Hub and previous versions of huggingface_hub cannot download them. IMO checking the subset "only regular characters and '-', '_', '.'" and "'--' and '..' are forbidden" is enough.
image

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 23, 2022

Hmm thinking twice about it, do we want to validate repo_ids in the form datasets/user/repo_name ?

It is not a valid repo_id in our definition, but I remember that I read in a thread that someone was using this trick to avoid passing repo_type="dataset". It is not an officially supported use case but not supporting it anymore would break some notebooks/scripts somewhere.

@julien-c
Copy link
Member

no strong opinion on whether we want to support this form datasets/user/repo_name or not (I'd be tempted to say no, to not have too many ≠ ways of doing things)

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 23, 2022

Just realized I didn't give my opinion. I would definitely avoid datasets/user/repo_name if possible because not documented but also we don't know which parts of the huggingface_hub handles it correctly and which ones don't.

We just need to be aware that will break some code elsewhere and we can hardly identify where beforehand.

@julien-c
Copy link
Member

yes, IMO breakage in the wild should be rather limited

@LysandreJik
Copy link
Member

I'd also definitely avoid datasets/user/repo_name. I don't think it'll break anything in the wild that we mentioned was supported. This should really be handled by repo_type="dataset".

Sounds good for validation!

@LysandreJik
Copy link
Member

(re last comment: it would have broken transformers, shame on me)

@PatrickStarMS
Copy link

validate_repo_id
raise HFValidationError(
huggingface_hub.utils.validators.HFValidationError: Repo id must use alphanumeric chars or '-', '', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: ''.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants