Skip to content

CHR is only ansi; we need a CHRW function #451

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
JoseRoca opened this issue Apr 8, 2025 · 4 comments
Open

CHR is only ansi; we need a CHRW function #451

JoseRoca opened this issue Apr 8, 2025 · 4 comments

Comments

@JoseRoca
Copy link

JoseRoca commented Apr 8, 2025

After achieving fully integration of my dynamic unicode string (DWSTRING), working with all the intrinsic FreeBasic string functions, I am attempting to solve the problem of unicode surrogate pairs and I already have code to do it when using my additional string functions (more than 30) The problem is that, as surrogates use two characters, when manipulating strings we can end breaking a surrogate by getting only half of it. The solution could be to check if the string to manipulate has surrogate pairs. If it has, they have to be replaced with unicode code points; once the string has been processed, these unicode points will be replaced with surrogate pairs. To do this last change, I needed to use CHR, but it is unusable for the purpose because it is only ansi. So I have needed to write my own:

FUNCTION ChrW (BYVAL codepoint AS UInteger) AS DWSTRING
   If codepoint <= &HFFFF Then Return WString(1, codepoint)
   ' Convert to UTF-16 surrogate pair for higher codepoints
   Dim As UShort highSurrogate = &HD800 Or ((codepoint - &H10000) Shr 10)
   Dim As UShort lowSurrogate = &HDC00 Or ((codepoint - &H10000) And &H3FF)
   Return WString(1, highSurrogate) + WString(1, lowSurrogate)
END FUNCTION

Other functions that I have written to deal with the surrogates are:

' ========================================================================================
' Converts surrogate pair to unicode code point
' Extracts the actual Unicode code point from a valid surrogate pair.
' ========================================================================================
FUNCTION SurrogatePairToCodePoint (BYVAL high AS USHORT, BYVAL low AS USHORT) AS ULONG
   IF IsValidSurrogatePair(high, low) THEN
      RETURN ((high - &HD800) * &H400) + (low - &HDC00) + &H10000
  END IF
  RETURN 0  ' Invalid surrogate pair
END FUNCTION
' ========================================================================================

' ========================================================================================
' Encode unicode code point as surrogate pair
' Converts a Unicode code point (above U+FFFF) back into its high and low surrogate pair.
' ========================================================================================
SUB CodePointToSurrogatePair (BYVAL codePoint AS ULONG, BYREF high AS USHORT, BYREF low AS USHORT)
   IF codePoint >= &H10000 AND codePoint <= &H10FFFF THEN
      high = &HD800 + ((codePoint - &H10000) \ &H400)
      low = &HDC00 + ((codePoint - &H10000) MOD &H400)
   ELSE
      high = 0
      low = 0
   END IF
END SUB
' ========================================================================================

An to check if the string has surrogates:

' ========================================================================================
FUNCTION HasSurrogates (BYREF text AS WSTRING) AS BOOLEAN
   FOR i AS LONG = 1 TO LEN(text)
      IF ASC(text, i) >= &HD800 AND ASC(text, i) <= &HDBFF THEN RETURN TRUE
   NEXT
   RETURN False
END FUNCTION
' ========================================================================================

' ========================================================================================
' Checks whether a UTF-16 character is the high part of a surrogate pair.
' ========================================================================================
FUNCTION IsHighSurrogate (BYVAL ch AS USHORT) AS BOOLEAN
   RETURN (ch >= &HD800 AND ch <= &HDBFF)
END FUNCTION
' ========================================================================================
' ========================================================================================
' Checks whether a UTF-16 character is the low part of a surrogate pair.
' ========================================================================================
FUNCTION IsLowSurrogate (BYVAL ch AS USHORT) AS BOOLEAN
   RETURN (ch >= &HDC00 AND ch <= &HDFFF)
END FUNCTION
' ========================================================================================
' ========================================================================================
' Checks whether a UTF-16 encoded string contains valid high-low surrogate pairs.
' ========================================================================================
FUNCTION IsValidSurrogatePair (BYVAL high AS USHORT, BYVAL low AS USHORT) AS BOOLEAN
   RETURN (high >= &HD800 AND high <= &HDBFF) AND (low >= &HDC00 AND low <= &HDFFF)
END FUNCTION
' ========================================================================================
@JoseRoca
Copy link
Author

JoseRoca commented Apr 10, 2025

I have added code to my DWSTRING class to detect and repair broken surrogates. And they work transparently when DWSTRING is used with the FreeBasic intrinsic procedures and operators. Instead of DIM s AS STRING, you only need to do DIM dws AS DWSTRING and work in the same way that with the ansi strings.

' ========================================================================================
' Batch surrogate search
' Scans in chunks of 64 characters for efficiency.
' Stops immediately when it detects a surrogate pair or a broken surrogate.
' Works well for large strings by iterating over segments of 64 characters at a time.
' ========================================================================================
PRIVATE FUNCTION ScanForSurrogates (BYREF inputStr AS WSTRING) AS BOOLEAN
   DIM AS LONG i, chunkSize = 64
   FOR i = 0 TO LEN(inputStr) - 1 STEP chunkSize
      DIM AS LONG endPos = i + chunkSize - 1
      IF endPos >= LEN(inputStr) THEN endPos = LEN(inputStr) - 1
      ' Process each chunk separately
      DIM AS LONG j
      FOR j = i TO endPos
         DIM AS USHORT ch = inputStr[j]
         ' Check if the character is a high surrogate
         IF ch >= &HD800 AND ch <= &HDBFF THEN
            ' // Full subrogate pair
            IF j + 1 <= endPos THEN
               DIM AS USHORT nextCh = inputStr[j + 1]
               ' Check if the next character is a low surrogate
               IF nextCh >= &HDC00 AND nextCh <= &HDFFF THEN RETURN TRUE
            END IF
            ' // Broken high subrogate
            RETURN TRUE
         ELSEIF ch>= &HDC00 AND ch <= &HDFFF THEN
            RETURN TRUE   ' // low subrogate
         END IF
      NEXT
   NEXT
   RETURN FALSE
END FUNCTION
' ========================================================================================
' ========================================================================================
' Appends the specified number of characters from the specified memory address to the end of the buffer.
' ========================================================================================
PRIVATE FUNCTION DWSTRING.AppendBuffer (BYVAL memAddr AS ANY PTR, BYVAL nChars AS UINT) AS BOOLEAN
   DWSTRING_DP("DWSTRING AppendBuffer - nChars = " & ..WSTR(nChars))
   IF memAddr = NULL OR nChars = 0 THEN RETURN FALSE
   ' // Check if the string contains surrogates
   DIM fHasSurrogates AS BOOLEAN = ScanForSurrogates(*cast(WSTRING PTR, memAddr))
   IF fHasSurrogates THEN
      DIM i AS LONG = 1
      WHILE i <= nChars
         DIM wch AS USHORT = ASC(*cast(WSTRING PTR, memAddr), i)
         IF IS_HIGH_SURROGATE(wch) THEN
            IF i + 1 <= nChars THEN
               DIM nextwch AS USHORT = ASC(*cast(WSTRING PTR, memAddr), i + 1)
               IF IS_LOW_SURROGATE(nextwch) THEN
                  i += 2
                  CONTINUE WHILE
               ELSE
                  ' // Replace broken surrogate with the &hFFFD symbol
                  DIM pStr AS WSTRING PTR = memAddr
                  pStr[i-1] = &hFFFD
               END IF
            ELSE
               ' Replace broken high surrogate at end of buffer with the &hFFFD symbol
               DIM pStr AS WSTRING PTR = memAddr
               pStr[i-1] = &hFFFD
            END IF
         ELSEIF IS_LOW_SURROGATE(wch) THEN
            ' Replace isolated low surrogate with the &hFFFD symbol
            DIM pStr AS WSTRING PTR = memAddr
            pStr[i-1] = &hFFFD
         ELSE
            ' // Normal character
            ' PRINT chr(wch)
         END IF
         i += 1
      WEND
   END IF
   ' // Number of characters to append
   DIM nSize AS UINT = m_BufferLen + nChars
   ' // If there is not enough capacity, resize the buffer
   IF nSize > m_Capacity THEN this.ResizeBuffer(nSize * 2)
   ' // Copy the passed buffer
   IF m_pBuffer = NULL THEN RETURN FALSE
   wmemmove(m_pBuffer + m_BufferLen, memAddr, nChars)
   ' // Update the length of the buffer
   m_BufferLen += nChars
   ' // Mark the end of the string with a double null
   m_pBuffer[m_BufferLen] = 0
   RETURN TRUE
END FUNCTION
' ========================================================================================

@countingpine
Copy link
Collaborator

Hi. It's worth mentioning that there is a Wchr function provided in FB.

I've added a link to it on the Chr page.

@JoseRoca
Copy link
Author

True. I missed it. Thanks for changing the documentation.

@JoseRoca
Copy link
Author

JoseRoca commented Apr 10, 2025

But my function ChrW remains useful because it works with high value codepoints.

With WChr, to display the 128512 codepoint (😀) I have to use the concatenation of the high and the low surrogate pairs.

MessageBoxW(GetActiveWindow, WChr(55357) + WChr(56832), "", MB_APPLMODAL)

Whereas ChrW can display the codepoint

MessageBoxW(GetActiveWindow, ChrW(128512), "", MB_APPLMODAL)

I have changed the code of ChrW to use WChr:

FUNCTION ChrW (BYVAL codepoint AS UInteger) AS DWSTRING
   If codepoint <= &HFFFF Then RETURN WCHR(codepoint)
   ' Convert to UTF-16 surrogate pair for higher codepoints
   Dim AS USHORT highSurrogate = &HD800 OR ((codepoint - &H10000) SHR 10)
   Dim AS USHORT lowSurrogate = &HDC00 OR ((codepoint - &H10000) AND &H3FF)
   Return WCHR(highSurrogate) + WCHR(lowSurrogate)
END FUNCTION

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants