Skip to content

Impl TryFrom<char> for u8 #2854

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
ebkalderon opened this issue Jan 22, 2020 · 11 comments
Closed

Impl TryFrom<char> for u8 #2854

ebkalderon opened this issue Jan 22, 2020 · 11 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@ebkalderon
Copy link

ebkalderon commented Jan 22, 2020

I ran into a case using the csv crate in combination with structopt where I needed to parse a single char from an argument to use as a CSV delimiter. However, it appears that csv::StringBuilder only accepts u8 delimiters. I was surprised to see that there was no type-safe way to convert between char to u8 in the standard library. It would be nice if u8 implemented TryFrom<char>, where the conversion fails if the char value is outside of the U+0000 to U+00FF codepoint range (0-255).

An example use would look like this:

fn main() {
    let success = u8::try_from('A').expect("Conversion should have succeeded");
    let failure = u8::try_from('我').expect_err("Conversion should have failed");
}
@kennytm
Copy link
Member

kennytm commented Jan 22, 2020

ASCII range

This differs from impl From<u8> for char which supports U+0000 to U+00FF.

@ebkalderon
Copy link
Author

@kennytm Good point! Thanks for pointing out the semantic difference in From<u8>. In that case, I think it's necessary to revise my proposal to specifically U+0000 to U+00FF (value range of 0-255).

@Lokathor
Copy link
Contributor

if all of 0..=255 is allowed I believe what you want can be done as follows:

if c as u32 <= u8::max_value() as u32 {
  Some(c as u8)
} else {
  None
}

@ebkalderon
Copy link
Author

@Lokathor Thanks for the code sample! That's actually very similar to what I've incorporated into my own project, so it's nice to see external confirmation that this general pattern exists. I was hoping to at least raise some discussion on whether a similar conversion could or should be included as part of the standard library.

@SimonSapin
Copy link
Contributor

This also works: u8::try_from(u32::from(some_char))

@fstirlitz
Copy link

If this is added, I'd prefer if it's a method on char called something like try_into_latin1, to clarify that using it constitutes a semantic conversion (and not just type-checking bureaucracy), and to discourage using it as the least-effort/least-thought-out way to convert a stream of chars into a stream of u8s (which should explicitly refer to some character encoding).

@SimonSapin
Copy link
Contributor

“Latin-1” may refer to slightly different things though.

  • Originally, it is Part 1: Latin alphabet No. 1 of ISO/IEC 8859. It is a set of 191 characters, each assigned an 8-bit value. The ranges 0x00 to 0x1F and 0x7F to 0x9F are not assigned.

  • IANA calls ISO-8859-1 an extension of the above that assigns the C0 and C1 control codes to these two ranges. This makes it compatible with ASCII, and all 256 bytes are assigned a character or control code. Unicode chose to make its first 256 code points U+0000 to U+00FF compatible with this, so u8::try_from(u32::from(some_char)) effectively encodes to IANA’s ISO-8859-1. To avoid confusion, https://infra.spec.whatwg.org/#isomorphic-encode calls this operation isomorphic encode instead.

  • Windows code page 1252 is a different extension of ISO/IEC 8859-1 that assigns various punctuation or letters to some of the 0x80 to 0x9F range. (In 0x00 to 0x7F it is compatible with ASCII and IANA’s ISO-8859-1 though) Windows-1252 is sometimes incorrectly called ISO-8859-1 or Latin-1.

  • In web browsers, at this point for compatibility reasons, many labels (used for example in charset in Content-Type) including ascii, latin1 and iso-8859-1 map to an extension of Windows-1252 where the 5 unassigned byte values are mapped to corresponding C1 control codes like in IANA’s ISO-8859-1.

Back to Rust API naming, we already have the precedent of impl From<u8> for char that does the opposite conversion. And my pre-1.0 proposal to rename str::as_bytes to str::as_utf8 was rejected.

@scottmcm
Copy link
Member

scottmcm commented Nov 10, 2020

FWIW, I think this makes sense. In general, any time there's A: From<B>, I think it makes sense to have B: TryFrom<A> for going the other way. (Assuming, of course, that it's possible to implement, but if it isn't then I'm also skeptical of the From...)

EDIT: Said otherwise, we have u8: TryFrom<u32>, and char: From<u8> is precedent for "u8 is a USV in 0..256" (as opposed to a code unit, or something), so we might as well have u8: TryFrom<char>.

@scottmcm scottmcm added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Nov 10, 2020
@burdges
Copy link

burdges commented Nov 11, 2020

Oh? Is there a convention to not provide A: From<B> if B: TryFrom<A> makes no sense, like say if A is a hash of B or something? Or did you mean only in the context of std?

I think wrapper types improve code quality of course, but sometimes people then use From/Into more once they have specific types, which harms readability over named methods.

I suppose one guideline would be impl From<B> for A if code would be improved by using the bound B: Into<A> but avoid doing so otherwise.

@scottmcm
Copy link
Member

Is there a convention to [...]

Nothing official, as far as I know. It's just a heuristic I use, since if you can't recover the original, then it generally seems to me like it's doing something non-obvious enough that it would be better as a method with a more illustrative name. (Of course, this is also only for Froms with the same ownership type -- I'm not expecting to be able to recover the original from something like String: From<&str>.)

Scrolling through std's From impls they do mostly match this. I probably wouldn't bother ever writing out the the TryFrom opposite of RecvTimeoutError: From<RecvError>, since it's just matches!(foo, RecvTimeoutError::Disconnected), but it would be completely plausible if someone wanted to have it. Similarly for String: From<char> -- I'm not sure that char: TryFrom<String> would be a good API (would one really ever want to .try_into() such that it fails if there's not exactly one char?) but it's pretty clear what the behaviour would be.

The ones that are a bit weirder are things like BinaryHeap<T>: From<Vec<T>> -- there's a From back the other way, but the first way was actually a heapify operation, so the original value can't be recovered exactly. I think that's a place where I'd have actually preferred it just be BinaryHeap::new without the from existing -- it gives a more accessible place to document complexity and such.

All that said, #2484 includes some text proposing rules more along these lines.

ehuss pushed a commit to ehuss/rust that referenced this issue Jan 8, 2022
Previously suggested in rust-lang/rfcs#2854.

It makes sense to have this since `char` implements `From<u8>`. Likewise
`u32`, `u64`, and `u128` (since rust-lang#79502) implement `From<char>`.
ehuss added a commit to ehuss/rust that referenced this issue Jan 8, 2022
Implement `TryFrom<char>` for `u8`

Previously suggested in rust-lang/rfcs#2854.

It makes sense to have this since `char` implements `From<u8>`. Likewise `u32`, `u64`, and `u128` (since rust-lang#79502) implement `From<char>`.
@ebkalderon
Copy link
Author

Done in rust-lang/rust#84640, which landed in 1.59.0! 🎉

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests

7 participants