Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

UTF-8 support #244

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

UTF-8 support #244

wants to merge 4 commits into from

Conversation

idea--list
Copy link

@idea--list idea--list commented Aug 26, 2019

Wanted to display glyphs from font.h files generated from UTF-8 font.ttf files.
Found some discussions in #185 suggesting the use of iso-8859-x chars, which IMHO is not a good idea (one needs to find a proper tool to convert UTF-8 .ttf files to iso first, then generate the font.h files, that will be still larger than the ASCII font.h files boundled with the library. Even in that case you might end up not beeing able to display a glyph if your iso-8859-x coded font did contain it. Of course a font.h generated from a UTF-8 .ttf will need more memory, but it's 2019 and newer boards tend to have a right amount of memory).

While looking for a solution i also found #200 from Bodmer who also provided me some help. He forked V1.3.6 and modified it but that version does not compile on my board... and since that version even the code has been reordered here in master branch...

So i ended up comparing Bodmer's modified version with the original master and as result here is my PR that enables UTF-8 with custom made fonts, while not changing anything with the default fonts.
UTF-8 support is turned off by default ensure not breaking anything on boards with few ram, you only have to enable it by adding display.utf8() to the sketch.

Tested with a MAX32630FTHR and 2.13" Flexible Monochrome eInk / ePaper Display.
For generating UTF-8 font.h files #185 is also needed

Please test on other board/display combinations and merge ;-)

@ladyada
Copy link
Member

ladyada commented Aug 26, 2019

hiya - this PR looks good, it does need a little doxygening
please read https://learn.adafruit.com/the-well-automated-arduino-library/doxygen

@idea--list
Copy link
Author

Hi @ladyada i never used Doxygen and for me using any unix/linux tools is a challange.
If there is anyone who knows how to use Doxygen i am happy if i do not need to touch it :-)

By the way i also ported the whole GFX and EPD libraries to Mbed-os. Am not sure if you are interested and if so where shall i post it?

@ladyada
Copy link
Member

ladyada commented Aug 27, 2019

hiya doxygen runs on mac, windows, linux - anything. its important to have all PR's with doxygen so that there is documentation to help support the code, we won't accept a PR without it. its very easy to doxygen, please check the guide we wrote

@param x true = enable (new behavior), false = disable (old behavior)
*/
/**********************************************************************/
void utf8(boolean x=true) { _utf8 = x; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x could have a better name. Maybe enable?

Also, giving it a default of true makes it a little ambiguous if no argument is given, because it looks like a getter. That is, gfx->utf8() can be read as a getter of the utf8 state of gfx. Better to just make them write true if they mean true.

But see above because I don't think utf8 should be a runtime switch like this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems i was not paying enough attention while copy&pasting code...
Of course this should default to false to prevent creating headache on lower-end boards after the update.


return (uint16_t)c; // fall-back to extended ASCII
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of new blank lines in this change. Seems like they should be omitted.


decoderState = 0;

return (uint16_t)c; // fall-back to extended ASCII
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion would happen without the C cast, and the cast could suppress warnings as the code evolves. Less is more when it comes to casts.

return (uint16_t)c;
}

if (decoderState == 0)
Copy link
Contributor

@BillyDonahue BillyDonahue Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a departure in bracing style from the rest of the library, which is more vertically compact. When in Rome...

We don't have to treat decoderState 2 and 1 differently, we can shift bits up as we consume them and that's a simpler algorithm.

We need to error out if non-leading bytes start with the 10 bitpattern.

There should be some provision for the result of the decode to be the NUL code point.
Returning 0 as an in-band "no-result" indicator prevents this. The decoder should return a status indicator and a separate result payload, or pick some other value as the indicator.

If I can suggest an alternate implementation:
https://gcc.godbolt.org/z/E0sEcZ
or https://gist.github.com/BillyDonahue/232420eb6eeeee4130c7803c4d59f1bd

#include <cstdint>


struct Utf8Decoder {
    Utf8Decoder() : state(0), value(0) {}

    /// \return
    ///     -1 : decoding error occurred.
    ///      0 : success but no value emitted yet.
    ///      1 : success and `value` member holds a Unicode16 payload.
    int decode(uint8_t c);

    int state;
    uint16_t value;
};

inline int Utf8Decoder::decode(uint8_t c) {
    if(state > 0){
        if((c>>6) == 0x2){  // 10XX'XXXX
            value = (value<<6) | (c & ((1<<6)-1));
            --state;
            return (state == 0) ? 1 : 0;
        }
    }else{
        if((c>>7) == 0x0){  // 0XXX'XXXX
            value = c;
            return 1;
        }
        if((c>>5) == 0x6){  // 110X'XXXX
            value = (c & ((1<<5)-1));
            state = 1;
            return 0;
        }
        if((c>>4) == 0xE){  // 1110'XXXX
            value = (c & ((1<<4)-1));
            state = 2;
            return 0;
        }
    }
    // error, but retain value 'c'
    state = 0;
    value = c;
    return -1;
}

Test:

#include <cassert>
#include <cctype>
#include <cstdint>
#include <cstdio>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>

int main() {
  // little test for the decoder
  using namespace std::literals;
  auto test = [](const std::string& s) {
    std::ostringstream report;
    report << std::hex;
    Utf8Decoder d;
    for(uint8_t c : s){
      switch (d.decode(c)) {
        case -1: report << "!"; break;
        case 0: report << '.'; break;
        case 1: report << "(" << d.value << ")"; break;
      }
    }
    fprintf(stderr, "[%s] => [%s]\n", s.c_str(), report.str().c_str());
    return report.str();
  };
  assert(test("") == "");
  assert(test("hello"s) == "(68)(65)(6c)(6c)(6f)");
  assert(test(u8"Привет"s) == ".(41f).(440).(438).(432).(435).(442)");
  assert(test(u8"Γειά σου"s) == ".(393).(3b5).(3b9).(3ac)(20).(3c3).(3bf).(3c5)");
  assert(test(u8"¿Qué?"s) == ".(bf)(51)(75).(e9)(3f)");
  assert(test(u8"värld"s) == "(76).(e4)(72)(6c)(64)");
  assert(test("\0"s) == "(0)");
  assert(test("\x80"s) == "!");
  assert(test("\xc0\x80"s) == ".(0)");
  assert(test("\xc0\x01"s) == ".!");
  assert(test("\xe0\x80\x80"s) == "..(0)");
  assert(test("\xe0\xe0\x80\x80\x41"s) == ".!!!(41)");
  return 0;
}

/**************************************************************************/
/*!
@brief Print one byte/character of data, used to support print()
@param c The 8-bit ascii character to write
@param utf8 The 8-bit UTF-8 or ascii code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8 isn't a parameter of this function, though. Is this left over from an earlier draft?

size_t Adafruit_GFX::write(uint8_t data) {

uint16_t c = (uint16_t)data;
if (_utf8) c = decodeUTF8(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency: if( not if (.

@@ -107,6 +108,10 @@ class Adafruit_GFX : public Print {
setTextSize(uint8_t sx, uint8_t sy),
setFont(const GFXfont *f = NULL);

// Serial UTF-8 decoder
uint16_t decodeUTF8(uint8_t c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that this should be a free function or class, not a member of Adafruit_GFX. It only modifies decoderState and decoderBuffer and doesn't interact with the rest of Adafruit_GFX. It's also reusable outside the context of graphics so there's some reuse potential if it's factored out.

So maybe what we really have here is a minimal Utf8Decoder API and some new member function variants. We'd still upgrade drawChar as you have done, and write would expand its input type.

size_t Adafruit_GFX::write(uint16_t ch);

Maybe provide a new write adaptor member function to make things easy to use?

size_t Adafruit_GFX::writeUtf8(uint8_t ch, Utf8Decoder& decoder);

This would feed ch to decoder and call gfx.write whenever this results in emission of a unicode character. It's more flexible to leave that to the users to do for themselves. The important thing is that the statefulness of the UTF8 decoder isn't part of the GFX object. The GFX object should always be ready to write() characters (say alerts from a timer interrupt), even if it's in the middle of writing other payload strings. If UTF8 decoder state is part of GFX you can get unlucky and have your alert strings corrupt your payload strings.

If writeUtf8 or Utf8Decoder::decode aren't called, their code can be optimized out of the build image. Such a footprint reduction would not be possible if utf8 support is configured at runtime as a GFX internal. We might reasonably say that it's not the responsibility of Adafruit_GFX to handle character decoding. It just knows how to draw and write Unicode16 characters, and users have to get those in whatever way is appropriate. They might just want to store their Unicode strings as arrays of uint16_t in ROM or something and skip UTF-8 altogether. Who knows?

@@ -211,6 +228,10 @@ class Adafruit_GFX : public Print {
/************************************************************************/
int16_t getCursorY(void) const { return cursor_y; };

// Set or get an arbitrary library attribute or configuration option
void setAttribute(uint8_t id = 0, uint8_t a = 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears not to have been implemented. Vestigial?

@idea--list
Copy link
Author

@BillyDonahue Thanks for the code review and comments i learn a lot from it though i can not understand everything. I began learning C/C++ just a few months ago. Was just happy to find a working fork and then updated the master branch based on that as meanwhile the code got rearranged. That way it compiled on my board and did not run into issues, so i thought the code was ok.

Please feel free to modify the PR, i think the result will be much better as if i would do that :-)

@idea--list
Copy link
Author

@BillyDonahue just added you as collaborator to my branch in case you can not change the code directly in this PR.

@adafruit adafruit deleted a comment from Jarvis785 Dec 6, 2020
@adafruit adafruit deleted a comment from Jarvis785 Dec 6, 2020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants