Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Lossy UTF8 conversion of owned types (Vec::<u8>::into_utf8_lossy). #116

Closed
thomcc opened this issue Sep 29, 2022 · 6 comments
Closed

Lossy UTF8 conversion of owned types (Vec::<u8>::into_utf8_lossy). #116

thomcc opened this issue Sep 29, 2022 · 6 comments
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@thomcc
Copy link
Member

thomcc commented Sep 29, 2022

Proposal

Problem statement

We should have a function that performs the same lossy conversion as String::from_utf8_lossy, but which goes from Vec<u8> => String, instead of &'a [u8] to Cow<'a, str>.

Motivation, use-cases

Our current function String::from_utf8_lossy optimizes for the case where you need to borrowed the input (a &[u8]) and can work with a borrowed output. It's very nice for this purpose, as it avoids the potentially costly copy in the common1 case that the input is already valid UTF-8.

Sadly, if you an need owned output (e.g. a String), there is no function in the stdlib that avoids copying for already-valid bytes, even if you're happy giving up your owned input Vec<u8>. In practice, you generally do with an expression like String::from_utf8_lossy(&vec).to_string(), which has the downside of always performing an extra copy if the input was valid UTF8 -- in other words, it pessimizes the already-valid-UTF8 case (it also has the dowside of being slightly strange looking, although rewording it to avoid this is likely possible).

It seems desirable to solve this by adding an analogous function that transforms an owned Vec<u8> (of potentially invalid UTF-8 bytes) into an owned String.

Solution sketches

I think the following API would be a good solution. A possible implementation is provided as well.

impl Vec<u8> {
    pub fn into_utf8_lossy(self) -> String {
        if let Cow::Owned(string) = String::from_utf8_lossy(&self) {
            string
        } else {
            // SAFETY: `String::from_utf8_lossy`'s contract ensures that if
            // it returns a `Cow::Borrowed`, it is identical to the input.
            unsafe { String::from_utf8_unchecked(self) }
        }
    }
}

I explored several other options in the past in the IRLO thread linked below.

Links and related work

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

Footnotes

  1. I'm speculating when I suggest that already valid input is the common case, but given that the usefulness of the result of this function is tied to the UTF-8 validity of the input (e.g. if it's mostly invalid UTF-8, then the output is likely to be less readable), it seems generally reasonable.

@thomcc thomcc added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Sep 29, 2022
@thomcc
Copy link
Member Author

thomcc commented Oct 11, 2022

(After a bit more thought, into_utf8_lossy is a bit of a weird name here, and into_string_lossy does seem better)

@the8472
Copy link
Member

the8472 commented Oct 29, 2022

I think that's the wrong place. String already has from_utf8(vec: Vec<u8>) -> Result<String, FromUtf8Error>.
A convenience method to resume the conversion in a lossy fashion could be added to FromUtf8Error.

@thomcc
Copy link
Member Author

thomcc commented Oct 31, 2022

Oh, I hadn't thought of that. Yeah, that's a really good place to put it.

@joshtriplett
Copy link
Member

We discussed this in today's @rust-lang/libs-api meeting. We ended up settling on a method on String: String::from_utf8_lossy_owned, taking a Vec<u8> and returning a String.

We'd also be happy to see a conversion method on FromUtf8Error.

@dtolnay dtolnay added the ACP-accepted API Change Proposal is accepted (seconded with no objections) label Jun 27, 2023
@m-ou-se m-ou-se closed this as completed Jun 27, 2023
@SergioBenitez
Copy link

This is marked as completed, but there's no linked PR, and String::from_utf8_lossy_owned() doesn't seem to exist anywhere. It looks like the last status was ACP-accepted. Was this issue incorrectly closed?

@ChrisDenton
Copy link
Member

The ACP was accepted so there's nothing more to track here. Anyone can implement the function.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Sep 15, 2024
Implement feature `string_from_utf8_lossy_owned` for lossy conversion from `Vec<u8>` to `String` methods

Accepted ACP: rust-lang/libs-team#116
Tracking issue: rust-lang#129436

Implement feature for lossily converting from `Vec<u8>` to `String`
- Add `String::from_utf8_lossy_owned`
- Add `FromUtf8Error::into_utf8_lossy`

---
Related to rust-lang#64727, but unsure whether to mark it "fixed" by this PR.
That issue partly asks for in-place replacement of the original allocation. We fulfill the other half of that request with these functions.

closes rust-lang#64727
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Sep 15, 2024
Rollup merge of rust-lang#129439 - okaneco:vec_string_lossy, r=Noratrieb

Implement feature `string_from_utf8_lossy_owned` for lossy conversion from `Vec<u8>` to `String` methods

Accepted ACP: rust-lang/libs-team#116
Tracking issue: rust-lang#129436

Implement feature for lossily converting from `Vec<u8>` to `String`
- Add `String::from_utf8_lossy_owned`
- Add `FromUtf8Error::into_utf8_lossy`

---
Related to rust-lang#64727, but unsure whether to mark it "fixed" by this PR.
That issue partly asks for in-place replacement of the original allocation. We fulfill the other half of that request with these functions.

closes rust-lang#64727
github-actions bot pushed a commit to rust-lang/miri that referenced this issue Sep 16, 2024
Implement feature `string_from_utf8_lossy_owned` for lossy conversion from `Vec<u8>` to `String` methods

Accepted ACP: rust-lang/libs-team#116
Tracking issue: #129436

Implement feature for lossily converting from `Vec<u8>` to `String`
- Add `String::from_utf8_lossy_owned`
- Add `FromUtf8Error::into_utf8_lossy`

---
Related to #64727, but unsure whether to mark it "fixed" by this PR.
That issue partly asks for in-place replacement of the original allocation. We fulfill the other half of that request with these functions.

closes #64727
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

7 participants