-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[discussion] Native unicode support #352
Comments
I'm not aware of any case which relies on being able to change the buffer after
|
Thanks. I ended up (ab)using the ANSI So any trick busybox or busybox-w32 do (like with the empty values, or the clearenv implementation), should work out of the box also with the native unicode mode.
As far as I can tell, when spawn is used - it takes the process env - not the CRT But if you modify Still on the subject of I see that there were some commits related to UCRT. As far as I can tell, the current state is that we can't spawn with UCRT with non-null And so, if we need to spawn with a custom Is that correct? Originally I meant to handle NULL env to spawn by simply replacing it with Can you reproduce the crash issue with non-NULL env? Any pointers how to build busybox-w32 with UCRT? |
Nofork applets run in the same process as their parent shell. There's no spawn in this case. Passing a non-NULL env to spawn with UCRT causes any subsequent spawn in the child process to fail. If the child process doesn't spawn a grandchild there's no problem. The workaround is to arrange things so that the initial spawn always has a NULL env. I haven't reproduced this recently. When I worked on the problem I used the UCRT variant of MSYS2 on Windows. |
Thanks. So how does one trigger the crash scenario as far as a user test case goes? Also, I might want to put all the UTF8 APIs implementations in a new C file (maybe
I see it's in Do I need to mention the |
You'd need to:
That shouldn't be necessary. |
Thanks. I still didn't get to try UCRT, but I'll update here once I do. Regardless, before you make the next release, can I please review the release notes about unicode? |
Thanks. Generally looks good. Few minor comments:
That might sound as if we're changing the console CP. Assuming we don't want to imply this, maybe remove "this is handled transparently", or maybe reword as something like "Unicode output and interactive input work with any console code page."
I believe this refers only to UTF8_INPUT and/or UTF8_OUTPUT? (I don't recall other workarounds) If yes, then it might be useful to use something like "... deficiencies in Microsoft's current support for UTF-8 in interactive use and printouts to the console" Or some other wording to limit the statement to console IO, as this doesn't refer, as far as I can tell, to the ability to work with files with unicode names, like with
Maybe add "especially when it comes to rendering unicode". I don't think there's difference in editing as long as the glyphs are supported. As an additional note, I found out that other terminals also work better than the windows console, like vscode builtin terminal (amazing combining chars rendering), or hyper or wez (that's not an endorsement), and possibly more. All the mentioned ones seem to handle combining chars better than the windows terminal, tested with this - check "The Three Kingdoms" alignment thingy (the whole page pastes just fine into a here-document in an interactive sh prompt ;) ). However, all of those also don't have mouse-wheel events for console apps which support it (like But I agree that mentioning only the windows terminal is enough.
A reminder that while it doesn't fix all busybox editing issues, I do have a version with I don't know how much it affects practical use cases, but I believe it would affect that rocketship prompt which was mentioned in another issue, because that rocket glyph is double width, but current busybox[-w32] doesn't know that, so if it appears as part of PS1 at the last line of the prompt, busybox will get confused with the line wrap (of the user input which follows). The full update to Unicode 15 does include it as double width.
As I think it's feasible to get the native unicode working, maybe change it to Regardless of Unicode, this looks like another good release, and good notes. Thanks again for maintaining it. |
I originally wrote
but thought that might imply the code page was changed automatically. So I said 'transparently' instead. End users probably don't care too much about the exact nature of Microsoft's deficiencies so I prefer to leave them unspecified. I'll have a look at the The release certainly isn't going to be FRP-5179. Testing has uncovered problems on Wine which will require at least one workaround. |
Sure. Let me know if you need anything.
Hmm... I presume that's in your private test setup?
First of all, I was wrong. Mouse events do work in wezterm, but not the others (vscode, hyper). FWIW, this seems to be an issue of older (win10, but not win11) wezterm bundles a newer |
I've decided not to take the I'm not impressed with the latest version of Wine in Fedora. There are a number of regressions. I've worked around the most serious; the rest can wait. It's not like Wine is an important platform for busybox-w32.
Not private: I'll restart my testing with FRP-5181 and see how that goes. |
Sure. But if the unicode build gains traction, then I think it would be good to get this or something similar in. The upstream I'm not suggesting to push it upstream because TBH I don't think we should deal with the "compression" of the data at upstream wcwidth whenever a new Unicode version is released (I think Unicode 16 should be released in few weeks). I don't think there are existing scripts to automate it from the Unicode data to the busybox "compression", so taking it from someplace which does automate it sounds a lot better to me, especially where we might have slightly less concerns with the balance of program size vs features.
Do the linux parts refer to testing busybox-w32 on linux using wine? or to testing the native linux busybox? Anyway, back to the subject of this issue: native unicode support. I've noticed that in some win32 files, Now, So the point is that I think I don't think I can identify a reason it's included first. Is there such reason? If this is expected to work because I'm asking this because the utf8 API would need similar global mapping which would probably apply via (of course, names which are already mapped globally currently, like |
The latter. It's possible to build and test a native Linux binary using the busybox-w32 source. The test output doesn't exactly match that obtained using the upstream source. That's because some Windows tests are included when they really shouldn't be. I'll fix that.
I've just been following that existing practice. |
Would you be against moving For instance, mapping Anyway, next subject. I was trying to evaluate which/how-many APIs would need to be mapped from ansi to UTF8. Searching APIs at the code did not seem productive, so I wrote a script which produces this list of the ansi symbols used by List of ansi symbols used by `busybox.exe` (about 100)
List of all symbols used by `bysybox.exe`, with the ansi ones marked, with the respective wide symbol
I'm guessing a good bunch of the ansi APIs don't need special handling, e.g. many IO APIs, like Once we have native UTF8 working, we can add a white list of allowed ansi symbols, and then use the script occasionally to check if we need to add more UTF8 wrappers etc. So this is largely an FYI, but I'd appreciate if you could glance over it and see if anything stands out as wrong. For reference, here's the current script: w32sym.zip |
Yeah, I think I overshoot a bit with the matching wide symbols. These are correct matching symbols (e.g. Compared, for instance, to So here's the revised shorter list (about half) of ansi symbols + wide matches
And here's the revised script. By default it now produces the short list (limited ANSI APIs), but it can also produce the long list with the general ansi APIs: w32sym.v2.zip But I'm guessing it can't detect symbols which are loaded dynamically? Not sure... |
Not if it's strictly necessary.
I guess not. They'll appear as strings in the binary not as symbols. Easy enough to find in the source, though. |
Hmm.. I can't reproduce this. I've built busybox.exe with UCRT on windows, using winlibs, i386, adding to the path a dir with busybox and the output of I got many warnings where LL_FMT and OFF_FMT are used - these seem to be ignored (it thinks Example warning:
I then used a (wide) spawn variant which replaces NULL env with environ, so that spawn is never called with a NULL env, and does get called with environ env. I couldn't quite figure out how to setup the make/time example, so I tried these, and all seem to work:
It would really help to have some one-liner test case to reproduce it... Anyway, for now, I'll keep the current UCRT code which uses NULL env, and I'll handle exporting the UTF8 env to the system env, because that would also typically be much quicker to convert only he unicode values instead of converting the whole UTF8 environ to wide. It's not a big function so that should be fine. But I would still like to be able to reproduce the UCRT crash when passing environ env, and then confirm that the NULL env fixes it. |
I tried also with the latest winlibs x86-64, plain non-any-utf8 build, with only this diff to the source: diff --git a/shell/ash.c b/shell/ash.c
index 5a5c947e8..ec11e8f56 100644
--- a/shell/ash.c
+++ b/shell/ash.c
@@ -9126,7 +9126,7 @@ tryexec(IF_FEATURE_SH_STANDALONE(int applet_no,) const char *cmd, char **argv, c
# else
if (APPLET_IS_NOEXEC(applet_no)) {
# endif
-#if ENABLE_PLATFORM_MINGW32 && !defined(_UCRT)
+#if 1 || ENABLE_PLATFORM_MINGW32 && !defined(_UCRT)
/* If building for UCRT move this up into shellexec() to
* work around a bug. */
clearenv();
@@ -9203,7 +9203,7 @@ static void shellexec(char *prog, char **argv, const char *path, int idx)
int applet_no = -1; /* used only by FEATURE_SH_STANDALONE */
envp = listvars(VEXPORT, VUNSET, /*strlist:*/ NULL, /*end:*/ NULL);
-#if ENABLE_PLATFORM_MINGW32 && defined(_UCRT)
+#if 0 && ENABLE_PLATFORM_MINGW32 && defined(_UCRT)
/* Avoid UCRT bug by updating parent's environment and passing a
* NULL environment pointer to execve(). */
clearenv(); And tried also this case: ./busybox sh -c 'X=x ./busybox sh -c "Z=z ./busybox"' Which also works fine as far as I can tell.. As for the warning, I think I previously missed also this line, which is in line with my statement that it's defined as
So I'm guessing in winlibs it doesn't like #if ENABLE_PLATFORM_MINGW32 && \
(!defined(__USE_MINGW_ANSI_STDIO) || !__USE_MINGW_ANSI_STDIO)
#define LL_FMT "I64"
#else
#define LL_FMT "ll"
#endif So maybe |
I'm also unable to reproduce the problem. Maybe Microsoft have fixed UCRT. |
OK, I'll leave it up to you to decide what to do with the UCRT workaround. Meanwhile, I'll just support the standard behavior with (utf8) spawn: allow and "forward" both NULL and non-NULL env, and if it's NULL then ensure the system wide env is up to date (instead of the smaller and slower hack of replacing NULL with environ and then converting all of it to wide). As for the UCRT I64 warnings, I have a patch and will send a PR soon. I'm Just confirming first it compiles without warnings with/without UCRT for i686/x86-64. |
OK, this rabbit hole was a bit deeper than expected. It bothered me that I was unable to reproduce the problem with UCRT. The original issue (#234) was related to an interaction between GNU make and busybox-w32 when they were both compiled with UCRT. I was able to reproduce the problem using GNU make from back then but wanted to do it solely with current busybox-w32 compiled with the current MinGW-w64 UCRT toolchain. The issue in GNU make is that it passes a non-default environment to diff --git a/win32/popen.c b/win32/popen.c
index 2208aa6bb..de2f2b1fb 100644
--- a/win32/popen.c
+++ b/win32/popen.c
@@ -190,6 +190,7 @@ static int mingw_popen_internal(pipe_data *p, const char *cmd,
int success;
int fd = -1;
int ip, ic, flags;
+ char env[] = "HELLO=1\0";
if ( cmd == NULL || *cmd == '\0' || mode == NULL ) {
return -1;
@@ -251,7 +252,7 @@ static int mingw_popen_internal(pipe_data *p, const char *cmd,
NULL, /* primary thread security attributes */
TRUE, /* handles are inherited */
0, /* creation flags */
- NULL, /* use parent's environment */
+ env,
NULL, /* use parent's current directory */
&siStartInfo, /* STARTUPINFO pointer */
&p->piProcInfo); /* receives PROCESS_INFORMATION */ Then I modified the shell to force it to spawn all applets and reverted the UCRT workaround in the shell: diff --git a/shell/ash.c b/shell/ash.c
index 2ea87a049..0f1862424 100644
--- a/shell/ash.c
+++ b/shell/ash.c
@@ -9103,7 +9103,7 @@ tryexec(IF_FEATURE_SH_STANDALONE(int applet_no,) const char *cmd, char **argv, c
{
#if ENABLE_FEATURE_SH_STANDALONE
if (applet_no >= 0) {
-# if ENABLE_PLATFORM_MINGW32
+# if ENABLE_PLATFORM_MINGW32 && 0
/* Treat all applets as NOEXEC, including the shell itself if
* this is a FS_SHELLEXEC shell. */
struct forkshell *fs = (struct forkshell *)sticky_mem_start;
@@ -9124,7 +9124,7 @@ tryexec(IF_FEATURE_SH_STANDALONE(int applet_no,) const char *cmd, char **argv, c
# else
if (APPLET_IS_NOEXEC(applet_no)) {
# endif
-#if ENABLE_PLATFORM_MINGW32 && !defined(_UCRT)
+#if ENABLE_PLATFORM_MINGW32 && !defined(_WHATEVER)
/* If building for UCRT move this up into shellexec() to
* work around a bug. */
clearenv();
@@ -9201,7 +9201,7 @@ static void shellexec(char *prog, char **argv, const char *path, int idx)
int applet_no = -1; /* used only by FEATURE_SH_STANDALONE */
envp = listvars(VEXPORT, VUNSET, /*strlist:*/ NULL, /*end:*/ NULL);
-#if ENABLE_PLATFORM_MINGW32 && defined(_UCRT)
+#if ENABLE_PLATFORM_MINGW32 && defined(_WHATEVER)
/* Avoid UCRT bug by updating parent's environment and passing a
* NULL environment pointer to execve(). */
clearenv(); To make the call to
The shell macro assignment (
The UCRT version without the workaround gives:
So, the workaround is still necessary. Without it GNU make still triggers the problem previously reported. |
It's good to know the workaround is still required, and apparently in Reference:
|
Here are some more thoughts (i'm not abandoning the native utf8, but still). Mingw-w64 recently changed their default to ucrt (at git master, probably for v12, if there wouldn't be too much backlash), and while w64devkit doesn't currently have plans to switch to ucrt, I'm guessing the defaults would eventually trickle down to most mingw setups. ucrt applications can run on XP. It does require installing the ucrt runtime (the is/was an official runtime for XP), but then it works, even when compiling with llvm 17 (with some XP compiler and linker options - if the mingw setup defaults to win 7 or later). Using a utf8 locale does not actually require the utf8 manifest. It can also be set using Another difference with runtime And, also unlike the manifest, this only works with ucrt. With msvcrt on win 10/11 (and older versions) the setlocale call fails, which is why I mentioned the mingw default change and that ucrt can run on XP as well. Here's an example of using the runtime setlocale approach: gwsw/less@ad440b6 It's relatively little code. More than the manifest (which basically requires no code changes), but not too much. I think it can trivially work in busybox-w32 as well. A manifest can still be added, which would bypass the initial conversion of argv and environ, but I'm guessing that the goal of this is to support XP up to latest with the same binary, and the manifest breaks XP. And, we might need to reconsider our position about the same binary being unicode on one system and ansi on another. Thoughts? |
So, I have a local patch to use this method as FEATURE_UTF8_UCRT, but it seems that it doesn't work fully, or I couldn't make it work, as I expected. TL;DR: So in busybox sh, I've reproduced the See also here gwsw/less#438 I don't know whether that's a bug of the UTF8 locale or a designed behavior - I don't think there's docs for this locale other than the linked page, and I don't quite get where the line goes between the manifest and the setlocale methods (other than argv, environ, etc). |
Or maybe I do. I'm guessing the Basically, roughly the libc API, which I guess kind of makes sense, but is not very helpful if one needs total UTF8 API, including the ANSI win32 APIs... |
@rmyorston |
Please stop it. The answer is no. Just replace the binary instead. |
So, I'm looking into adding native unicode support.
Basically similar to what the utf8 manifest does, but manually (adding
U
variant of win32 APIs which use utf8 prototypes and then internally calls theW
APIs - instead of letting the manifest change theA
APIs into utf8).So I created this issue to ask related questions instead of opening an issue on each question.
First question: I might need to maintain a utf8 version of
environ
independent of the systemenviron
(with its own *env API). Does busybox-w32 expectputenv
to add the string to the environment, so that e.g. this prints1
?With POSIX *env API (and on linux) it prints 1, but on windows, as far as I can tell
_putenv
makes a copy ofxenv
, so the user's string doesn't actually become part of the environment, and it prints 0.Also, with POSIX, it's valid to change
environ
to point to something else (user-provided array, for instance).I see that
mingw_putenv
does a trick to set an empty value, and it's indeed aware that the buffer gets copied (henceenviron
is iterated to find the copy to truncate).Do you know if these use cases (change the buffer after
putenv
, or changeenviron
) happen with upstream busybox? If yes, how does it work on windows?(
setenv
does make a copy, and so it can be more efficient to access using something less linear thanenviron
).The text was updated successfully, but these errors were encountered: