Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

to_json(std::filesystem::path) can create invalid UTF-8 chars on windows #4271

Open
2 tasks done
MHebes opened this issue Jan 19, 2024 · 13 comments
Open
2 tasks done
Labels
kind: bug state: please discuss please discuss the issue or vote for your favorite option

Comments

@MHebes
Copy link

MHebes commented Jan 19, 2024

Description

This conversion function:

https://github.com/nlohmann/json/blob/7efe875495a3ed7d805ddbb01af0c7725f50c88b/include/nlohmann/detail/conversions/to_json.hpp#L416C1-L420C2

template<typename BasicJsonType>
inline void to_json(BasicJsonType& j, const std_fs::path& p)
{
    j = p.string();
}

uses p.string(), which does not give a UTF-8-encoded string on windows (in some cases, maybe?). Trying to dump() the resultant JSON throws a "invalid UTF-8 byte" exception.

Reproduction steps

Convert a std::filesystem::path, which contains a unicode "Right Single Quotation Mark" character (U+2019), to a json implicitly or with to_json.

Inspect the new json (string_t)'s bytes, either by dump()ing, or converting to BSON.

Expected vs. actual results

Expected: "Strings are stored in UTF-8 encoding." per https://json.nlohmann.me/api/basic_json/string_t/

Actual: The string gets converted by std::filesystem::path::string(), which appears to convert it to Windows-1252 encoding. Its bytes end up as \x92 rather than \xe2\x80\x99.

Minimal code example

#include <filesystem>
#include <iostream>
#include <nlohmann/json.hpp>

int main() {
  try {
    wchar_t wide_unicode_right_quote[2] = {0x2019, 0};  // came from a directory_iterator in reality
    nlohmann::json apost = std::filesystem::path(wide_unicode_right_quote);
    std::cout << apost << std::endl;
    return 0;
  } catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
    return 1;
  }
}

Workaround I'm using is to use WideCharToMultiByte + .native() to get the string in UTF-8 before passing to nlohmann:

inline std::string Narrow(std::wstring_view wstr) {
  if (wstr.empty()) return {};
  int len = ::WideCharToMultiByte(CP_UTF8, 0, &wstr[0], wstr.size(), nullptr, 0, nullptr, nullptr);
  std::string out(len, 0);
  ::WideCharToMultiByte(CP_UTF8, 0, &wstr[0], wstr.size(), &out[0], len, nullptr, nullptr);
  return out;
}

int main() {
  try {
    wchar_t wide_unicode_right_quote[2] = {0x2019, 0};  // came from a directory_iterator in reality
    nlohmann::json apost = Narrow(std::filesystem::path(wide_unicode_right_quote).native());
    std::cout << apost << std::endl;
    return 0;
  } catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
    return 1;
  }
}

Error messages

"[json.exception.type_error.316] invalid UTF-8 byte at index 0: 0x92

Compiler and operating system

MSVC 2022 Professional, C++ 20

Library version

develop - a259ecc

Validation

@MHebes
Copy link
Author

MHebes commented Feb 16, 2024

I can also workaround this problem by adding a manifest XML that sets my app's code page to CP_UTF8 on supported versions of windows.

In CMake I wrapped this in a function:

# target_add_manifest(<target> <manifest file>)
#
# You probably want to use ${MANIFEST_FILE_UTF8} defined below this function
#
# Adds a manifest file (https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests)
# to an EXE
function(target_add_manifest TARGET_NAME MANIFEST_FILE)
  if(NOT TARGET_NAME)
	  message(FATAL_ERROR "You must provide a target")
	endif()
	if(NOT MANIFEST_FILE)
	  message(FATAL_ERROR "You must provide a manifest file")
	endif()
	add_custom_command(
		TARGET ${TARGET_NAME}
		POST_BUILD
		COMMAND "mt.exe" -manifest \"${MANIFEST_FILE}\" \"-updateresource:$<TARGET_FILE:${TARGET_NAME}>\"
	)
endfunction()

which is used like this (probably want to wrap in a platform check):

add_executable(myapp main.cpp)
target_add_manifest(myapp "${CMAKE_CURRENT_SOURCE_DIR}/cmake/utf8.manifest")

with utf8.manifest being:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
  <application>
    <windowsSettings>
      <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
    </windowsSettings>
  </application>
</assembly>

This solves the problem, if the app is running on at least Windows Version 1903. Still a bug but wanted to share this workaround because it's useful for many libraries that have the same issue.

@MHebes
Copy link
Author

MHebes commented Feb 16, 2024

Proposed diff to do the conversion to UTF-8 when targeting windows:

diff --git a/include/nlohmann/detail/conversions/to_json.hpp b/include/nlohmann/detail/conversions/to_json.hpp
index 562089c3..a8b74688 100644
--- a/include/nlohmann/detail/conversions/to_json.hpp
+++ b/include/nlohmann/detail/conversions/to_json.hpp
@@ -413,10 +413,20 @@ inline void to_json(BasicJsonType& j, const T& t)
 }
 
 #if JSON_HAS_FILESYSTEM || JSON_HAS_EXPERIMENTAL_FILESYSTEM
+#if defined(_WIN32)
+#include <windows.h>
+#endif
 template<typename BasicJsonType>
 inline void to_json(BasicJsonType& j, const std_fs::path& p)
 {
+#if defined(_WIN32)
+    int len = ::WideCharToMultiByte(CP_UTF8, 0, &p.native()[0], p.native().size(), nullptr, 0, nullptr, nullptr);
+    std::string as_utf8(len, 0);
+    ::WideCharToMultiByte(CP_UTF8, 0, &p.native()[0], p.native().size(), &narrowed_string[0], len, nullptr, nullptr);
+    j = std::move(as_utf8);
+#else
     j = p.string();
+#endif
 }
 #endif

@zel1b08a
Copy link

zel1b08a commented Jun 1, 2024

path may be represented in some ways (native/generic_string/string/u8string/e.t.c), so, I think it should be decided on client side how to store it before put it to json object.

Just j = p.u8string(); may be should be there to corresponds the docs.

@nlohmann
Copy link
Owner

I am not sure how to proceed here as I am not a Windows user. Any idea?

@nlohmann nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label Nov 18, 2024
@MHebes
Copy link
Author

MHebes commented Nov 18, 2024

I am not sure how to proceed here as I am not a Windows user. Any idea?

I'm happy to make a PR, up to you what solution to use though:

  1. Use OS APIs to convert to UTF-8 explicitly like in the above diff. Adds a Windows.h dependency. May not fix it for some theoretical non-windows but non-utf-8-native platforms (if they exist).
  2. Use std::filesystem::path::u8string() when supported. Use the current (buggy) behavior otherwise.
  3. Deprecate the built in path converters and/or document the deficiency but otherwise make no code changes.

With (1) and (2) I'm not sure the effect on custom StringTypes. The documentation says "Strings are stored in UTF-8 encoding" which implies that StringType has to have a char value type.

@gregmarr
Copy link
Contributor

gregmarr commented Nov 18, 2024

Use std::filesystem::path::u8string() when supported. Use the current (buggy) behavior otherwise.

When is it not supported? I don't see any indication that it's optional. I see that on Windows it can throw if the string can't be represented in utf-8, so that's a consideration.

@MHebes
Copy link
Author

MHebes commented Nov 18, 2024

When is it not supported?

Sorry, read the reference page wrong!

In that case, @zel1b08a's solution seems best by far. Replace with .u8string(), done & easy.

Small chance of changing the behavior of already-broken code I guess (e.g. someone calling path::path(const char*) on windows with UTF8-encoded chars instead of the active code page)? Not sure if that matters or is possible.

@nlohmann
Copy link
Owner

Awesome. Looking forward to a pull request :)

@risa2000
Copy link

The OS "favored" way how to process unicode on Windows is to use std::wstring, which uses UTF-16LE encoding. (Un)fortunately everyone else favors UTF-8. std::filesystem::path can be constructed from std::string, std::wstring and bunch of specialized unicode types, where in the first case it uses "multibyte" encoding (local code page), in the second case it uses UTF-16LE. Similarly, when converting the path back to the string, using .string() will emit local code page "multibyte" encoding, while .wstring() will emit UTF-16LE encoding. No one wants to mess with local code page encoding and/or converting.

In C++17 there were two functions std::filesystem::path::u8path() to create a path from UTF-8 encoded std::string and std::filesystem::path::u8string() to convert the path back to UTF-8 encoded std::string.
This has been changed in C++20 in a way that u8path has been dropped and u8string() now returns std::u8string, which is basically useless.

Since nlohmann::json assumes that strings are implicitly in UTF-8 the only way to handle this C++ discrepancy correctly on Windows would be to convert std::string into an "intermediate format", either std::wstring (UTF-16LE encoding), or std::u8string (with UTF-8), then pass it to std::filesystem::path constructor, and vice versa, i.e. when serializing to JSON, "extract" either of types from std::filesystem::path and then convert to UTF-8 encoded std::string.

Windows "native" way is to use already mentioned ::WideCharToMultiByte() for conversion from std::wstring to UTF-8 encoded std::string, or, which I have not tried, but I imagine, there might be a more straightforward way to convert between std::u8string and std::string (with UTF-8 encoding). Something like a char copy in either direction.

@nlohmann
Copy link
Owner

If it helps: the library contains code to convert UTF-16 or UTF-32 to UTF-8 (see wide_string_input_helper in include/nlohmann/detail/input/input_adapters.hpp. And since the issue of converting to UTF-8 is only relevant to Windows, maybe relying on ::WideCharToMultiByte() could be preferred.

@risa2000
Copy link

risa2000 commented Jan 19, 2025

I was thinking more about using all the power of the standard lib to avoid calling Win32 API and just before starting to write such an implementation I found exactly this already done here: https://github.com/alf-p-steinbach/C---how-to---make-non-English-text-work-in-Windows/blob/main/microlibs/cppm/stdlib_workarounds/fs_path.hpp#L19

I just cleaned it up a bit and did few cosmetic changes: removed external types and dependencies, and replaced string_view with std::string parameter in path_from_u8.
I also had to better define the C++ std detection as __cplusplus macro is only defined if /Zc:__cplusplus compiler option is defined (which is usually not), so therefore _MSVC_LANG is used instead.
I do not know if the library already uses conditionally compiled paths based on __cplusplus, but if it does, the already defined macro should be used instead.

The cleaned up code for an explicit conversion between std::filesystem::path and UTF-8 encoded std::string looks like this:

#if defined(_MSVC_LANG) && !defined(__clang__)
#define NLOHMANN_JSON_CPP_STD _MSVC_LANG
#else
#define NLOHMANN_JSON_CPP_STD __cplusplus
#endif

inline auto path_from_u8(const std::string& path) -> std::filesystem::path
{
#if NLOHMANN_JSON_CPP_STD < 202002 // `<` b/c `u8path` is deprecated in C++20; ⇨ warnings.
    return std::filesystem::u8path(path);
#else
    return std::filesystem::path(std::u8string_view(reinterpret_cast<const char8_t*>(path.data()), path.size()));
#endif
}

inline auto to_u8_string(const std::filesystem::path& p) -> std::string
{
#if NLOHMANN_JSON_CPP_STD < 202002
    return p.u8string(); // Returns a `std::string` in C++17.
#else
    const std::u8string s = p.u8string();
    return std::string(s.begin(), s.end()); // Needless copy except for C++20 nonsense.
#endif
}

I see certain beauty in this solution: The same codepath works on any platform, regardless its locale, and it does not need to use Win32 API call. The price to pay is one useless copy in to_u8_string. Is it worthy - I do not know.

BTW: Are the "other" platforms (e.g. linux and macOS) only allowing UTF-8 encoding in locale nowadays? I remember there were times when LC_LOCALE could be set to ISO page and in this case I believe even there the "default" std::string used by std::filesystem::path will not get UTF-8 encoded?

@risa2000
Copy link

@nlohmann
If you agree with the aforementioned approach I will submit a PR for this.

@nlohmann
Copy link
Owner

Sure, please go ahead.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind: bug state: please discuss please discuss the issue or vote for your favorite option
Projects
None yet
Development

No branches or pull requests

5 participants