Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[discussion] Native unicode support #352

Open
avih opened this issue Aug 11, 2023 · 27 comments
Open

[discussion] Native unicode support #352

avih opened this issue Aug 11, 2023 · 27 comments

Comments

@avih
Copy link
Contributor

avih commented Aug 11, 2023

So, I'm looking into adding native unicode support.

Basically similar to what the utf8 manifest does, but manually (adding U variant of win32 APIs which use utf8 prototypes and then internally calls the W APIs - instead of letting the manifest change the A APIs into utf8).

So I created this issue to ask related questions instead of opening an issue on each question.

First question: I might need to maintain a utf8 version of environ independent of the system environ (with its own *env API). Does busybox-w32 expect putenv to add the string to the environment, so that e.g. this prints 1 ?

char xenv[] = "x=0";
putenv(xenv);
xenv[2] = '1';
printf("%s\n", getenv("x"));

With POSIX *env API (and on linux) it prints 1, but on windows, as far as I can tell _putenv makes a copy of xenv, so the user's string doesn't actually become part of the environment, and it prints 0.

Also, with POSIX, it's valid to change environ to point to something else (user-provided array, for instance).

I see that mingw_putenv does a trick to set an empty value, and it's indeed aware that the buffer gets copied (hence environ is iterated to find the copy to truncate).

Do you know if these use cases (change the buffer after putenv, or change environ) happen with upstream busybox? If yes, how does it work on windows?

(setenv does make a copy, and so it can be more efficient to access using something less linear than environ).

@rmyorston
Copy link
Owner

I'm not aware of any case which relies on being able to change the buffer after putenv.

ash changes environ in evalcommand() when running a nofork applet. The existing value of environ is saved and restored after the applet completes.

@avih
Copy link
Contributor Author

avih commented Aug 15, 2023

Thanks.

I ended up (ab)using the ANSI environ to store the UTF8 values, same as I did 3 years ago here #17 (comment) so basically the native unicode thing uses the ANSI environ normally as far as the system goes, just that the encoding of non-ascii values is utf8 instead of ACP (and the mingw getenv/putenv/etc wrappers remain as is).

So any trick busybox or busybox-w32 do (like with the empty values, or the clearenv implementation), should work out of the box also with the native unicode mode.

ash changes environ in evalcommand()

As far as I can tell, when spawn is used - it takes the process env - not the CRT environ, but the process env is hopefully up to date because putenv updates both the crt environ (and possibly also _wenviron) and also the process env (e.g. using SetEnvironmentVariable[W]).

But if you modify environ to point to something else, how does the process env gets updated to the new values?

Still on the subject of environ.

I see that there were some commits related to UCRT. As far as I can tell, the current state is that we can't spawn with UCRT with non-null env, or else it will crash?

And so, if we need to spawn with a custom env array, it basically clears the current env and sets it from this custom array, and then calls spawn with a NULL env so that it would be taken from the environment (which is now a copy of the custom array).

Is that correct?

Originally I meant to handle NULL env to spawn by simply replacing it with environ (which is then converted to a wide array before calling _wspawnve), so that would presumably crash with UCRT? (it's not a problem to update the process env if it's null, it's just less code to replace NULL with environ, even if possibly slower to convert all of environ to a wide array).

Can you reproduce the crash issue with non-NULL env? Any pointers how to build busybox-w32 with UCRT?

@rmyorston
Copy link
Owner

Nofork applets run in the same process as their parent shell. There's no spawn in this case.

Passing a non-NULL env to spawn with UCRT causes any subsequent spawn in the child process to fail. If the child process doesn't spawn a grandchild there's no problem. The workaround is to arrange things so that the initial spawn always has a NULL env.

I haven't reproduced this recently. When I worked on the problem I used the UCRT variant of MSYS2 on Windows.

@avih
Copy link
Contributor Author

avih commented Aug 16, 2023

Passing a non-NULL env to spawn with UCRT causes any subsequent spawn in the child process to fail. If the child process doesn't spawn a grandchild there's no problem. The workaround is to arrange things so that the initial spawn always has a NULL env.

Thanks.

So how does one trigger the crash scenario as far as a user test case goes?

Also, I might want to put all the UTF8 APIs implementations in a new C file (maybe mingw-utf8.c or some such), possibly with a matching header.

How should I go about it as far as makefiles/other-requirements go? I tried grepping mingw.c and mingw.o (to check how mingw.c is compiled depending on CONFIG_PLATFORM_MINGW32) but couldn't find how this happens.

I see it's in win32/Kbuild. I used ag (the silver searcher) and apparently it skipped some files...

Do I need to mention the .h file someplace so that it rebuilds the .c file when the header changes?

@rmyorston
Copy link
Owner

So how does one trigger the crash scenario as far as a user test case goes?

You'd need to:

  • Revert the fix in the shell: sed -i 's/_UCRT/_WHATEVER/' shell/ash.c
  • Build busybox-w32 using the UCRT variant of MSYS2
  • Have the shell spawn an application that spawns some other application. In the original issue the first application was GNU make and I used busybox-w32 time as the second.

Do I need to mention the .h file someplace

That shouldn't be necessary.

@avih
Copy link
Contributor Author

avih commented Aug 17, 2023

Thanks.

I still didn't get to try UCRT, but I'll update here once I do.

Regardless, before you make the next release, can I please review the release notes about unicode?

@rmyorston
Copy link
Owner

Draft release notes

@avih
Copy link
Contributor Author

avih commented Aug 17, 2023

Draft release notes

Thanks. Generally looks good. Few minor comments:

It isn't necessary to adjust the console code page: this is handled transparently.

That might sound as if we're changing the console CP. Assuming we don't want to imply this, maybe remove "this is handled transparently", or maybe reword as something like "Unicode output and interactive input work with any console code page."

It has been necessary to apply workarounds for deficiencies in Microsoft's current support for UTF-8. There may be undiscovered problems and the workarounds themselves may have issues.

I believe this refers only to UTF8_INPUT and/or UTF8_OUTPUT? (I don't recall other workarounds)

If yes, then it might be useful to use something like "... deficiencies in Microsoft's current support for UTF-8 in interactive use and printouts to the console"

Or some other wording to limit the statement to console IO, as this doesn't refer, as far as I can tell, to the ability to work with files with unicode names, like with tar or cat.

The Windows Terminal works better than the Windows console.

Maybe add "especially when it comes to rendering unicode". I don't think there's difference in editing as long as the glyphs are supported.

As an additional note, I found out that other terminals also work better than the windows console, like vscode builtin terminal (amazing combining chars rendering), or hyper or wez (that's not an endorsement), and possibly more.

All the mentioned ones seem to handle combining chars better than the windows terminal, tested with this - check "The Three Kingdoms" alignment thingy (the whole page pastes just fine into a here-document in an interactive sh prompt ;) ).

However, all of those also don't have mouse-wheel events for console apps which support it (like less for windows), while such apps do work at the windows terminal/console.

But I agree that mentioning only the windows terminal is enough.

Editing (cursor movement, delete, etc) can be buggy, especially with double-width or combining characters.

A reminder that while it doesn't fix all busybox editing issues, I do have a version with wcwidth updated to latest Unicode (15). We already pushed a tiny update only for emojis, but the full update has much more. I can arrange quickly a PR with the full update if you want, which will be this, but it will use it unconditionally with UTF8_MANIFEST (and disable the existing table). Or just feel free to take this commit, edit it as you see fit, and push it yourself.

I don't know how much it affects practical use cases, but I believe it would affect that rocketship prompt which was mentioned in another issue, because that rocket glyph is double width, but current busybox[-w32] doesn't know that, so if it appears as part of PS1 at the last line of the prompt, busybox will get confused with the line wrap (of the user input which follows). The full update to Unicode 15 does include it as double width.

(mingw64-gcc 12.2.1-8.fc38; mingw64-crt 10.0.0-4.fc38; glob; Unicode)

As I think it's feasible to get the native unicode working, maybe change it to Unicode-manifest or Unicode-Win10 so that the difference becomes visible (I'm not suggesting again to also change the download file name).

Regardless of Unicode, this looks like another good release, and good notes. Thanks again for maintaining it.

@rmyorston
Copy link
Owner

I originally wrote

It isn't necessary to adjust the console code page: this is handled automatically.

but thought that might imply the code page was changed automatically. So I said 'transparently' instead.

End users probably don't care too much about the exact nature of Microsoft's deficiencies so I prefer to leave them unspecified.

I'll have a look at the wcwidth commit later.

The release certainly isn't going to be FRP-5179. Testing has uncovered problems on Wine which will require at least one workaround.

@avih
Copy link
Contributor Author

avih commented Aug 17, 2023

I'll have a look at the wcwidth commit later.

Sure. Let me know if you need anything.

The release certainly isn't going to be FRP-5179. Testing has uncovered problems on Wine which will require at least one workaround.

Hmm... I presume that's in your private test setup?

However, all of those also don't have mouse-wheel events for console apps which support it (like less for windows), while such apps do work at the windows terminal/console.

First of all, I was wrong. Mouse events do work in wezterm, but not the others (vscode, hyper).

FWIW, this seems to be an issue of older (win10, but not win11) conpty.dll, which apparently doesn't do mouse events well.

wezterm bundles a newer conpty.dll, so mouse wheel works.

@rmyorston
Copy link
Owner

I've decided not to take the wcwidth commit for now. I always seem to have cause to regret last-minute changes made just before a release.

I'm not impressed with the latest version of Wine in Fedora. There are a number of regressions. I've worked around the most serious; the rest can wait. It's not like Wine is an important platform for busybox-w32.

Hmm... I presume that's in your private test setup?

Not private: testsuite/runtest in busybox-w32 works on Windows as well as Linux. There are more failures on Windows, but that's only to be expected.

I'll restart my testing with FRP-5181 and see how that goes.

@avih
Copy link
Contributor Author

avih commented Aug 18, 2023

I've decided not to take the wcwidth commit for now

Sure. But if the unicode build gains traction, then I think it would be good to get this or something similar in. The upstream wcwidth is 15 years out of date, and the number of meme double-width codepoints/icons has grown considerably since then - both in official support and in various meme use cases, like very fancy PS1.

I'm not suggesting to push it upstream because TBH I don't think we should deal with the "compression" of the data at upstream wcwidth whenever a new Unicode version is released (I think Unicode 16 should be released in few weeks).

I don't think there are existing scripts to automate it from the Unicode data to the busybox "compression", so taking it from someplace which does automate it sounds a lot better to me, especially where we might have slightly less concerns with the balance of program size vs features.

Not private: testsuite/runtest in busybox-w32 works on Windows as well as Linux. There are more failures on Windows, but that's only to be expected.

Do the linux parts refer to testing busybox-w32 on linux using wine? or to testing the native linux busybox?

Anyway, back to the subject of this issue: native unicode support.

I've noticed that in some win32 files, libbb.h is included before other standard headers, for instance at fsync.c or glob.c or poll.c or even mingw.c.

Now, libbb.h includes mingw.h, and mingw.h has some mappings of default names, such as fopen to mingw_fopen, etc.

So the point is that I think libbb.h should be included last, so that it doesn't overrides system names before their respective system header is included.

I don't think I can identify a reason it's included first. Is there such reason?

If this is expected to work because libbb.h already includes these headers, and so they'll no-op the second #include, then I think it's kind of meh IMO, but would still be good to know.

I'm asking this because the utf8 API would need similar global mapping which would probably apply via mingw.h (like CreateFileA to createFile_U), but if windows.h is included after libbb.h, then this poses an issue (and same for other names which would require such mappings).

(of course, names which are already mapped globally currently, like mingw_fopen, would not have global mapping to the utf8 variants. These name would map to the utf8 variants at the C file where the current mingw/winansi wrapper is implemented).

@rmyorston
Copy link
Owner

Do the linux parts refer to testing busybox-w32 on linux using wine? or to testing the native linux busybox?

The latter. It's possible to build and test a native Linux binary using the busybox-w32 source. The test output doesn't exactly match that obtained using the upstream source. That's because some Windows tests are included when they really shouldn't be. I'll fix that.

libbb.h tends to be included first because that's how it's always been done:

fa37e9da4b (Nguyễn Thái Ngọc Duy 2010-04-14 02:12:01 +0200    5) #include "libbb.h"
fa37e9da4b (Nguyễn Thái Ngọc Duy 2010-04-14 02:12:01 +0200    6) #include <windows.h>
1ee308c75f (Ron Yorston          2021-12-22 08:22:12 +0000    7) #include "lazyload.h"
fa37e9da4b (Nguyễn Thái Ngọc Duy 2010-04-14 02:12:01 +0200    8) #undef PACKED
fa37e9da4b (Nguyễn Thái Ngọc Duy 2010-04-14 02:12:01 +0200    9)

I've just been following that existing practice.

@avih
Copy link
Contributor Author

avih commented Aug 19, 2023

I've just been following that existing practice.

Would you be against moving libbb.h it to the end?

For instance, mapping fprintf to mingw_printf would be disastarous if this happens before fprintf is declared...

Anyway, next subject.

I was trying to evaluate which/how-many APIs would need to be mapped from ansi to UTF8. Searching APIs at the code did not seem productive, so I wrote a script which produces this list of the ansi symbols used by busybox.exe.

List of ansi symbols used by `busybox.exe` (about 100)
$ w32sym.sh -A busybox.exe
GetUserNameA
CreateEventA
CreateFileA
CreateFileMappingA
CreateNamedPipeA
CreateProcessA
FillConsoleOutputCharacterA
FindFirstFileA
FindNextFileA
GetCompressedFileSizeA
GetDiskFreeSpaceExA
GetDriveTypeA
GetFileAttributesA
GetFileAttributesExA
GetFullPathNameA
GetModuleFileNameA
GetModuleHandleA
GetSystemDirectoryA
GetVersionExA
GetVolumeInformationA
LoadLibraryA
LoadLibraryExA
MoveFileExA
Process32First
Process32Next
ReadConsoleInputA
SetConsoleTitleA
SetEnvironmentVariableA
SetFileAttributesA
__argv
__getmainargs
__initenv
_access
_ctime64
_environ
_fdopen
_fullpath
_getch
_open
_putenv
_vscprintf
_vsnprintf
fgets
fopen
fprintf
fputc
fputs
getenv
isalnum
isalpha
iscntrl
isgraph
islower
isprint
ispunct
isspace
isupper
isxdigit
putc
putchar
rename
setlocale
sprintf
sscanf
strcat
strchr
strcmp
strcpy
strcspn
strftime
strlen
strncmp
strncpy
strpbrk
strrchr
strspn
strstr
strtok
strtol
strtoul
tolower
toupper
ungetc
vfprintf
_unlink
_spawnve
_rmdir
_open
_mktemp
_mkdir
_getcwd
_getche
_fdopen
_creat
_chmod
_chdir
DispatchMessageA
PeekMessageA
WSASocketA
List of all symbols used by `bysybox.exe`, with the ansi ones marked, with the respective wide symbol
$ w32sym.sh -a busybox.exe
ADVAPI32.dll:
    AccessCheck
    CheckTokenMembership
    DuplicateToken
    EqualSid
    GetSecurityInfo
    GetTokenInformation
[A] GetUserNameA (GetUserNameW)
    OpenProcessToken
    SaferComputeTokenFromLevel
    SaferCreateLevel
    SetTokenInformation
    SystemFunction036

KERNEL32.dll:
    CloseHandle
    CreateConsoleScreenBuffer
[A] CreateEventA (CreateEventW)
[A] CreateFileA (CreateFileW)
[A] CreateFileMappingA (CreateFileMappingW)
[A] CreateNamedPipeA (CreateNamedPipeW)
    CreatePipe
[A] CreateProcessA (CreateProcessW)
    CreateRemoteThread
    CreateToolhelp32Snapshot
    DeleteCriticalSection
    DeviceIoControl
    DuplicateHandle
    EnterCriticalSection
    FileTimeToSystemTime
    FillConsoleOutputAttribute
[A] FillConsoleOutputCharacterA (FillConsoleOutputCharacterW)
    FindClose
[A] FindFirstFileA (FindFirstFileW)
    FindFirstVolumeW
[A] FindNextFileA (FindNextFileW)
    FindNextVolumeW
    FindVolumeClose
    FlushFileBuffers
    FreeEnvironmentStringsW
    GenerateConsoleCtrlEvent
    GetACP
    GetCPInfo
    GetCommandLineW
[A] GetCompressedFileSizeA (GetCompressedFileSizeW)
    GetConsoleCP
    GetConsoleMode
    GetConsoleOutputCP
    GetConsoleScreenBufferInfo
    GetConsoleWindow
    GetCurrentProcess
[A] GetDiskFreeSpaceExA (GetDiskFreeSpaceExW)
    GetDiskFreeSpaceExW
[A] GetDriveTypeA (GetDriveTypeW)
    GetEnvironmentStringsW
    GetEnvironmentVariableW
    GetExitCodeProcess
[A] GetFileAttributesA (GetFileAttributesW)
[A] GetFileAttributesExA (GetFileAttributesExW)
    GetFileInformationByHandle
    GetFileSizeEx
    GetFileType
[A] GetFullPathNameA (GetFullPathNameW)
    GetLastError
    GetLogicalDrives
[A] GetModuleFileNameA (GetModuleFileNameW)
[A] GetModuleHandleA (GetModuleHandleW)
    GetNumberOfConsoleInputEvents
    GetProcAddress
    GetProcessAffinityMask
    GetProcessId
    GetProcessTimes
    GetStdHandle
[A] GetSystemDirectoryA (GetSystemDirectoryW)
    GetSystemInfo
    GetSystemTimeAsFileTime
    GetTickCount
[A] GetVersionExA (GetVersionExW)
[A] GetVolumeInformationA (GetVolumeInformationW)
    GetVolumeInformationW
    InitializeCriticalSection
    IsDBCSLeadByteEx
    IsValidCodePage
    IsWow64Process
    LeaveCriticalSection
[A] LoadLibraryA (LoadLibraryW)
[A] LoadLibraryExA (LoadLibraryExW)
    LocalFree
    MapViewOfFile
[A] MoveFileExA (MoveFileExW)
    MultiByteToWideChar
    OpenProcess
    PeekConsoleInputW
    PeekNamedPipe
[A] Process32First (Process32FirstW)
[A] Process32Next (Process32NextW)
[A] ReadConsoleInputA (ReadConsoleInputW)
    ReadConsoleInputW
    ReadDirectoryChangesW
    ReadProcessMemory
    ResetEvent
    SetConsoleActiveScreenBuffer
    SetConsoleCP
    SetConsoleCtrlHandler
    SetConsoleCursorPosition
    SetConsoleMode
    SetConsoleOutputCP
    SetConsoleScreenBufferSize
    SetConsoleTextAttribute
[A] SetConsoleTitleA (SetConsoleTitleW)
    SetEndOfFile
[A] SetEnvironmentVariableA (SetEnvironmentVariableW)
    SetEnvironmentVariableW
    SetErrorMode
[A] SetFileAttributesA (SetFileAttributesW)
    SetFilePointer
    SetFileTime
    SetHandleInformation
    SetLastError
    SetSystemTime
    SetUnhandledExceptionFilter
    Sleep
    SleepEx
    TerminateProcess
    TlsGetValue
    UnmapViewOfFile
    VirtualProtect
    VirtualQuery
    WaitForMultipleObjects
    WaitForSingleObject
    WideCharToMultiByte
    WriteConsoleW

msvcrt.dll:
    __C_specific_handler
    ___mb_cur_max_func
[A] __argv (__wargv)
[A] __getmainargs (__wgetmainargs)
[A] __initenv (__winitenv)
    __iob_func
    __set_app_type
    __setusermatherr
[A] _access (_waccess)
    _amsg_exit
    _cexit
    _commode
[A] _ctime64 (_wctime64)
[A] _environ (_wenviron)
    _errno
    _exit
[A] _fdopen (_wfdopen)
    _filbuf
    _fmode
    _fstat64
[A] _fullpath (_wfullpath)
    _get_osfhandle
[A] _getch (_getwch)
    _gmtime64
    _initterm
    _isatty
    _localtime64
    _lseeki64
    _mktime64
    _onexit
[A] _open (_wopen)
    _open_osfhandle
    _pipe
[A] _putenv (_wputenv)
    _setjmp
    _setmode
    _stricmp
    _strnicmp
    _telli64
    _time64
    _timezone
    _tzset
[A] _vscprintf (_vscwprintf)
[A] _vsnprintf (_vsnwprintf)
    _wspawnve
    abort
    atof
    atoi
    bsearch
    calloc
    clearerr
    clock
    exit
    fclose
    feof
    ferror
    fflush
[A] fgets (fgetws)
[A] fopen (_wfopen)
[A] fprintf (fwprintf)
[A] fputc (fputwc)
[A] fputs (fputws)
    fread
    free
    fseek
    fwrite
[A] getenv (_wgetenv)
[A] isalnum (iswalnum)
[A] isalpha (iswalpha)
[A] iscntrl (iswcntrl)
[A] isgraph (iswgraph)
[A] islower (iswlower)
[A] isprint (iswprint)
[A] ispunct (iswpunct)
[A] isspace (iswspace)
[A] isupper (iswupper)
[A] isxdigit (iswxdigit)
    localeconv
    longjmp
    malloc
    mbstowcs
    memchr
    memcmp
    memmove
[A] putc (putwc)
[A] putchar (putwchar)
    qsort
    rand
    realloc
[A] rename (_wrename)
    setbuf
[A] setlocale (_wsetlocale)
    signal
[A] sprintf (swprintf)
    srand
[A] sscanf (swscanf)
[A] strcat (wcscat)
[A] strchr (wcschr)
[A] strcmp (wcscmp)
[A] strcpy (wcscpy)
[A] strcspn (wcscspn)
    strerror
[A] strftime (wcsftime)
[A] strlen (wcslen)
[A] strncmp (wcsncmp)
[A] strncpy (wcsncpy)
[A] strpbrk (wcspbrk)
[A] strrchr (wcsrchr)
[A] strspn (wcsspn)
[A] strstr (wcsstr)
[A] strtok (wcstok)
[A] strtol (wcstol)
[A] strtoul (wcstoul)
[A] tolower (towlower)
[A] toupper (towupper)
[A] ungetc (ungetwc)
[A] vfprintf (vfwprintf)
    wcschr
    wcscpy
    wcslen
    wcsncmp
    wcstombs
    _strtoui64
    _strtoi64
    _write
    _wcsnicmp
[A] _unlink (_wunlink)
    _umask
    _tzset
    _strdup
[A] _spawnve (_wspawnve)
[A] _rmdir (_wrmdir)
    _read
[A] _open (_wopen)
[A] _mktemp (_wmktemp)
[A] _mkdir (_wmkdir)
    _getpid
[A] _getcwd (_wgetcwd)
[A] _getche (_getwche)
    _fileno
[A] _fdopen (_wfdopen)
    _dup2
    _dup
[A] _creat (_wcreat)
    _close
[A] _chmod (_wchmod)
[A] _chdir (_wchdir)
    _stricmp

SHELL32.dll:
    CommandLineToArgvW

USER32.dll:
[A] DispatchMessageA (DispatchMessageW)
    MsgWaitForMultipleObjects
[A] PeekMessageA (PeekMessageW)
    TranslateMessage

WS2_32.dll:
    WSACleanup
    WSAEnumNetworkEvents
    WSAEventSelect
    WSAGetLastError
    WSASetLastError
[A] WSASocketA (WSASocketW)
    WSAStartup
    __WSAFDIsSet
    accept
    bind
    closesocket
    connect
    freeaddrinfo
    getaddrinfo
    gethostbyaddr
    gethostname
    getnameinfo
    getpeername
    getservbyname
    inet_addr
    inet_ntoa
    listen
    recv
    select
    setsockopt
    shutdown

I'm guessing a good bunch of the ansi APIs don't need special handling, e.g. many IO APIs, like fprintf, already work fine with UTF8 too (i.e. no need to add a UTF8 wrapper) , and similarly for string APIs like strlen (which has an equivalent wcslen), or we probably don't need iswlower, etc.

Once we have native UTF8 working, we can add a white list of allowed ansi symbols, and then use the script occasionally to check if we need to add more UTF8 wrappers etc.

So this is largely an FYI, but I'd appreciate if you could glance over it and see if anything stands out as wrong.

For reference, here's the current script: w32sym.zip

@avih
Copy link
Contributor Author

avih commented Aug 21, 2023

I'm guessing a good bunch of the ansi APIs don't need special handling...

Yeah, I think I overshoot a bit with the matching wide symbols.

These are correct matching symbols (e.g. strlen, wcslen) but they're only different because they operate on different type.

Compared, for instance, to [_w]fopen or CreateFile{A,W}, which have different reach - the ANSI ones are more limited compared to the wide ones, and these are the ones we mainly care about.

So here's the revised shorter list (about half) of ansi symbols + wide matches
$ w32sym.sh -a ./busybox.exe | grep '\['
[A] GetUserNameA (GetUserNameW)
[A] CreateEventA (CreateEventW)
[A] CreateFileA (CreateFileW)
[A] CreateFileMappingA (CreateFileMappingW)
[A] CreateNamedPipeA (CreateNamedPipeW)
[A] CreateProcessA (CreateProcessW)
[A] FillConsoleOutputCharacterA (FillConsoleOutputCharacterW)
[A] FindFirstFileA (FindFirstFileW)
[A] FindNextFileA (FindNextFileW)
[A] GetCompressedFileSizeA (GetCompressedFileSizeW)
[A] GetDiskFreeSpaceExA (GetDiskFreeSpaceExW)
[A] GetDriveTypeA (GetDriveTypeW)
[A] GetFileAttributesA (GetFileAttributesW)
[A] GetFileAttributesExA (GetFileAttributesExW)
[A] GetFullPathNameA (GetFullPathNameW)
[A] GetModuleFileNameA (GetModuleFileNameW)
[A] GetModuleHandleA (GetModuleHandleW)
[A] GetSystemDirectoryA (GetSystemDirectoryW)
[A] GetVersionExA (GetVersionExW)
[A] GetVolumeInformationA (GetVolumeInformationW)
[A] LoadLibraryA (LoadLibraryW)
[A] LoadLibraryExA (LoadLibraryExW)
[A] MoveFileExA (MoveFileExW)
[A] Process32First (Process32FirstW)
[A] Process32Next (Process32NextW)
[A] ReadConsoleInputA (ReadConsoleInputW)
[A] SetConsoleTitleA (SetConsoleTitleW)
[A] SetEnvironmentVariableA (SetEnvironmentVariableW)
[A] SetFileAttributesA (SetFileAttributesW)
[A] __argv (__wargv)
[A] __getmainargs (__wgetmainargs)
[A] __initenv (__winitenv)
[A] _access (_waccess)
[A] _ctime64 (_wctime64)
[A] _environ (_wenviron)
[A] _fdopen (_wfdopen)
[A] _fullpath (_wfullpath)
[A] _open (_wopen)
[A] _putenv (_wputenv)
[A] fopen (_wfopen)
[A] getenv (_wgetenv)
[A] rename (_wrename)
[A] setlocale (_wsetlocale)
[A] _unlink (_wunlink)
[A] _spawnve (_wspawnve)
[A] _rmdir (_wrmdir)
[A] _open (_wopen)
[A] _mktemp (_wmktemp)
[A] _mkdir (_wmkdir)
[A] _getcwd (_wgetcwd)
[A] _fdopen (_wfdopen)
[A] _creat (_wcreat)
[A] _chmod (_wchmod)
[A] _chdir (_wchdir)
[A] DispatchMessageA (DispatchMessageW)
[A] PeekMessageA (PeekMessageW)
[A] WSASocketA (WSASocketW)

And here's the revised script. By default it now produces the short list (limited ANSI APIs), but it can also produce the long list with the general ansi APIs: w32sym.v2.zip

But I'm guessing it can't detect symbols which are loaded dynamically? Not sure...

@rmyorston
Copy link
Owner

Would you be against moving libbb.h it to the end?

Not if it's strictly necessary.

But I'm guessing it can't detect symbols which are loaded dynamically? Not sure...

I guess not. They'll appear as strings in the binary not as symbols. Easy enough to find in the source, though.

@avih
Copy link
Contributor Author

avih commented Aug 24, 2023

  • Have the shell spawn an application that spawns some other application. In the original issue the first application was GNU make and I used busybox-w32 time as the second.

Hmm.. I can't reproduce this.

I've built busybox.exe with UCRT on windows, using winlibs, i386, adding to the path a dir with busybox and the output of busybox.exe --install . minus ar, plus xxd, pkg-config and make from the w64devkit bin dir (it probably doesn't need make because winlibs has it, but it does seem to need a non-busybox xxd).

I got many warnings where LL_FMT and OFF_FMT are used - these seem to be ignored (it thinks "...%"LL_FMT"d" is %d, and I'm not sure why, but I know that it definitely goes through the #define LL_FMT "I64" line), but it did build, and it does depend/linked with ucrtbase.dll.

Example warning:

editors/awk.c:925:44: warning: format '%d' expects argument of type 'int', but argument 4 has type 'long long int' [-Wformat=]
  925 |                 snprintf(g_buf, MAXVARFMT, "%"LL_FMT"d", (long long)n);
      |                                            ^~~           ~~~~~~~~~~~~
      |                                                          |
      |                                                          long long int
editors/awk.c:925:54: note: format string is defined here
  925 |                 snprintf(g_buf, MAXVARFMT, "%"LL_FMT"d", (long long)n);
      |                                             ~~~~~~~~~^
      |                                                      |
      |                                                      int
      |                                             %"LL_FMT"lld

I then used a (wide) spawn variant which replaces NULL env with environ, so that spawn is never called with a NULL env, and does get called with environ env.

I couldn't quite figure out how to setup the make/time example, so I tried these, and all seem to work:

  • ./busybox sh -c './busybox sh -c busybox'
  • ./busybox sh -c 'sh -c "sh -c busybox"'
  • ./busybox sh -c 'sh -c "./busybox printf xxx | ./busybox grep xxx"'

It would really help to have some one-liner test case to reproduce it...

Anyway, for now, I'll keep the current UCRT code which uses NULL env, and I'll handle exporting the UTF8 env to the system env, because that would also typically be much quicker to convert only he unicode values instead of converting the whole UTF8 environ to wide. It's not a big function so that should be fine.

But I would still like to be able to reproduce the UCRT crash when passing environ env, and then confirm that the NULL env fixes it.

@avih
Copy link
Contributor Author

avih commented Aug 26, 2023

I tried these, and all seem to work:

./busybox sh -c './busybox sh -c busybox'
./busybox sh -c 'sh -c "sh -c busybox"'
./busybox sh -c 'sh -c "./busybox printf xxx | ./busybox grep xxx"'

I tried also with the latest winlibs x86-64, plain non-any-utf8 build, with only this diff to the source:

diff --git a/shell/ash.c b/shell/ash.c
index 5a5c947e8..ec11e8f56 100644
--- a/shell/ash.c
+++ b/shell/ash.c
@@ -9126,7 +9126,7 @@ tryexec(IF_FEATURE_SH_STANDALONE(int applet_no,) const char *cmd, char **argv, c
 # else
 		if (APPLET_IS_NOEXEC(applet_no)) {
 # endif
-#if ENABLE_PLATFORM_MINGW32 && !defined(_UCRT)
+#if 1 || ENABLE_PLATFORM_MINGW32 && !defined(_UCRT)
 			/* If building for UCRT move this up into shellexec() to
 			 * work around a bug. */
 			clearenv();
@@ -9203,7 +9203,7 @@ static void shellexec(char *prog, char **argv, const char *path, int idx)
 	int applet_no = -1; /* used only by FEATURE_SH_STANDALONE */
 
 	envp = listvars(VEXPORT, VUNSET, /*strlist:*/ NULL, /*end:*/ NULL);
-#if ENABLE_PLATFORM_MINGW32 && defined(_UCRT)
+#if 0 && ENABLE_PLATFORM_MINGW32 && defined(_UCRT)
 	/* Avoid UCRT bug by updating parent's environment and passing a
 	 * NULL environment pointer to execve(). */
 	clearenv();

And tried also this case:

./busybox sh -c 'X=x ./busybox sh -c "Z=z ./busybox"'

Which also works fine as far as I can tell..

As for the warning, I think I previously missed also this line, which is in line with my statement that it's defined as I64:

include/libbb.h:293:17: note: format string is defined here
  293 | #define LL_FMT "I64"
      |                 ^

So I'm guessing in winlibs it doesn't like I64? which is weird. It's defined like this:

#if ENABLE_PLATFORM_MINGW32 && \
	(!defined(__USE_MINGW_ANSI_STDIO) || !__USE_MINGW_ANSI_STDIO)
#define LL_FMT "I64"
#else
#define LL_FMT "ll"
#endif

So maybe __USE_MINGW_ANSI_STDIO is ignored/doesn't-exist with UCRT? this might make sense, because the mingw stdio exists to work around msvcrt issues, which maybe don't exist with UCRT, and so this define is gone?

@rmyorston
Copy link
Owner

I'm also unable to reproduce the problem. Maybe Microsoft have fixed UCRT.

@avih
Copy link
Contributor Author

avih commented Aug 26, 2023

OK, I'll leave it up to you to decide what to do with the UCRT workaround.

Meanwhile, I'll just support the standard behavior with (utf8) spawn: allow and "forward" both NULL and non-NULL env, and if it's NULL then ensure the system wide env is up to date (instead of the smaller and slower hack of replacing NULL with environ and then converting all of it to wide).

As for the UCRT I64 warnings, I have a patch and will send a PR soon.

I'm Just confirming first it compiles without warnings with/without UCRT for i686/x86-64.

@rmyorston
Copy link
Owner

OK, this rabbit hole was a bit deeper than expected.

It bothered me that I was unable to reproduce the problem with UCRT. The original issue (#234) was related to an interaction between GNU make and busybox-w32 when they were both compiled with UCRT.

I was able to reproduce the problem using GNU make from back then but wanted to do it solely with current busybox-w32 compiled with the current MinGW-w64 UCRT toolchain.

The issue in GNU make is that it passes a non-default environment to CreateProcess(). In busybox-w32 CreateProcess() is used in the implementation of popen(3), so I hacked that to have a non-default environment:

diff --git a/win32/popen.c b/win32/popen.c
index 2208aa6bb..de2f2b1fb 100644
--- a/win32/popen.c
+++ b/win32/popen.c
@@ -190,6 +190,7 @@ static int mingw_popen_internal(pipe_data *p, const char *cmd,
    int success;
    int fd = -1;
    int ip, ic, flags;
+   char env[] = "HELLO=1\0";

    if ( cmd == NULL || *cmd == '\0' || mode == NULL ) {
        return -1;
@@ -251,7 +252,7 @@ static int mingw_popen_internal(pipe_data *p, const char *cmd,
                NULL,              /* primary thread security attributes */
                TRUE,              /* handles are inherited */
                0,                 /* creation flags */
-               NULL,              /* use parent's environment */
+               env,
                NULL,              /* use parent's current directory */
                &siStartInfo,      /* STARTUPINFO pointer */
                &p->piProcInfo);   /* receives PROCESS_INFORMATION */

Then I modified the shell to force it to spawn all applets and reverted the UCRT workaround in the shell:

diff --git a/shell/ash.c b/shell/ash.c
index 2ea87a049..0f1862424 100644
--- a/shell/ash.c
+++ b/shell/ash.c
@@ -9103,7 +9103,7 @@ tryexec(IF_FEATURE_SH_STANDALONE(int applet_no,) const char *cmd, char **argv, c
 {
 #if ENABLE_FEATURE_SH_STANDALONE
    if (applet_no >= 0) {
-# if ENABLE_PLATFORM_MINGW32
+# if ENABLE_PLATFORM_MINGW32 && 0
        /* Treat all applets as NOEXEC, including the shell itself if
         * this is a FS_SHELLEXEC shell. */
        struct forkshell *fs = (struct forkshell *)sticky_mem_start;
@@ -9124,7 +9124,7 @@ tryexec(IF_FEATURE_SH_STANDALONE(int applet_no,) const char *cmd, char **argv, c
 # else
        if (APPLET_IS_NOEXEC(applet_no)) {
 # endif
-#if ENABLE_PLATFORM_MINGW32 && !defined(_UCRT)
+#if ENABLE_PLATFORM_MINGW32 && !defined(_WHATEVER)
            /* If building for UCRT move this up into shellexec() to
             * work around a bug. */
            clearenv();
@@ -9201,7 +9201,7 @@ static void shellexec(char *prog, char **argv, const char *path, int idx)
    int applet_no = -1; /* used only by FEATURE_SH_STANDALONE */

    envp = listvars(VEXPORT, VUNSET, /*strlist:*/ NULL, /*end:*/ NULL);
-#if ENABLE_PLATFORM_MINGW32 && defined(_UCRT)
+#if ENABLE_PLATFORM_MINGW32 && defined(_WHATEVER)
    /* Avoid UCRT bug by updating parent's environment and passing a
     * NULL environment pointer to execve(). */
    clearenv();

To make the call to popen(3) I used this Makefile:

sum != sum Makefile

target:
	@echo sum is $(sum)

The shell macro assignment (!=) uses popen(3). Using a UCRT-compiled version of busybox-w32 make from current git master gives:

~ $ make
sum is 22355 1

The UCRT version without the workaround gives:

~ $ make
sum is

So, the workaround is still necessary. Without it GNU make still triggers the problem previously reported.

@avih
Copy link
Contributor Author

avih commented Aug 28, 2023

Using a UCRT-compiled version of busybox-w32 make

It's good to know the workaround is still required, and apparently in _wspawnve as well, judging by the links below.

Reference:

@avih
Copy link
Contributor Author

avih commented Oct 4, 2023

Here are some more thoughts (i'm not abandoning the native utf8, but still).

Mingw-w64 recently changed their default to ucrt (at git master, probably for v12, if there wouldn't be too much backlash), and while w64devkit doesn't currently have plans to switch to ucrt, I'm guessing the defaults would eventually trickle down to most mingw setups.

ucrt applications can run on XP. It does require installing the ucrt runtime (the is/was an official runtime for XP), but then it works, even when compiling with llvm 17 (with some XP compiler and linker options - if the mingw setup defaults to win 7 or later).

Using a utf8 locale does not actually require the utf8 manifest. It can also be set using setlocale(".UTF8"). Similar to the manifest, this would only succeed on win10 1803+ (the manifest is 1903+ - not the same version), but unlike the manifest, it also runs on XP, and if setlocale failed then no problem - it remains ANSI.

Another difference with runtime setlocale is that argv and environ are already ANSI - unlike with the manifest, so they need to become utf8. Also, apparently, CP_ACP and GetACP() are as if setlocale was never called (but of course, the current locale - which is affected by setlocale - can be determined at runtime).

And, also unlike the manifest, this only works with ucrt. With msvcrt on win 10/11 (and older versions) the setlocale call fails, which is why I mentioned the mingw default change and that ucrt can run on XP as well.

Here's an example of using the runtime setlocale approach: gwsw/less@ad440b6

It's relatively little code. More than the manifest (which basically requires no code changes), but not too much. I think it can trivially work in busybox-w32 as well.

A manifest can still be added, which would bypass the initial conversion of argv and environ, but I'm guessing that the goal of this is to support XP up to latest with the same binary, and the manifest breaks XP.

And, we might need to reconsider our position about the same binary being unicode on one system and ansi on another.

Thoughts?

@avih
Copy link
Contributor Author

avih commented Feb 5, 2024

Using a utf8 locale does not actually require the utf8 manifest. It can also be set using setlocale(".UTF8"). Similar to the manifest

So, I have a local patch to use this method as FEATURE_UTF8_UCRT, but it seems that it doesn't work fully, or I couldn't make it work, as I expected.

TL;DR: fopen seems to work and does open UTF-8 names correctly, and the docs mention that _mkdir and _getcwd do become UTF-8, and (with this busybox build) the spawn* family does invoke apps with correct unicode args (i.e. seemingly argv is converted correctly to UTF8 by new code in this patch, and spawn uses this UTF8 without further conversion to wide chars), but as far as I can tell Find{First,Next}FileA seem to return question marks instead of non-ASCII codepoints.

So in busybox sh, cat unicode-path, cd unicode-path, pwd, non-applet.exe unicode-arg seem to work OK, but ls doesn't list unicode names correctly.

I've reproduced the fopen and FindFirstFileA cases in a test program, where with setlocale only the former becomes UTF8, while with the UTF8 manifest both seem to become UTF8.

See also here gwsw/less#438

I don't know whether that's a bug of the UTF8 locale or a designed behavior - I don't think there's docs for this locale other than the linked page, and I don't quite get where the line goes between the manifest and the setlocale methods (other than argv, environ, etc).

@avih
Copy link
Contributor Author

avih commented Feb 5, 2024

and I don't quite get where the line goes between the manifest and the setlocale methods (other than argv, environ, etc).

Or maybe I do. I'm guessing the setlocale method only applies to msvcrt/ucrt API, like fopen, but indeed not like FindFirstFileA.

Basically, roughly the libc API, which I guess kind of makes sense, but is not very helpful if one needs total UTF8 API, including the ANSI win32 APIs...

@ale5000-git
Copy link

@rmyorston
@avih
Wouldn't it be possible to allow to user to enable/disable the interactive UTF-8 editing at runtime using a special BB_* var?

@avih
Copy link
Contributor Author

avih commented Aug 18, 2024

Please stop it. The answer is no. Just replace the binary instead.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants