Emoji with multiple code units not detected #99

foost · 2020-06-03T18:10:10Z

First: apologies if I provide insufficient information or use wrong terminology. This is my first GitHub issue ever, so please be kind

Demo code works fine for me, including emojis. However, the demo emoji are described by a single code unit. Emojis with more than one, e.g. "red heart" (2764 FE0F) are not detected, despite being in the lexicon.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentences = ["Catch utf-8 emoji such as such as 💘 and 💋 and 😁",  # emojis handled
             "Not bad at all",  # Capitalized negation
             "Me and Fay are 4 years old today ❤️ (ft Grumio)…"
             ]
analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<65} {}".format(sentence, str(vs)))`

returns

Catch utf-8 emoji such as such as 💘 and 💋 and 😁------------------ {'neg': 0.0, 'neu': 0.615, 'pos': 0.385, 'compound': 0.875}
Not bad at all--------------------------------------------------- {'neg': 0.0, 'neu': 0.513, 'pos': 0.487, 'compound': 0.431}
Me and Fay are 4 years old today ❤️ (ft Grumio)… {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

The text was updated successfully, but these errors were encountered:

foost · 2020-06-10T15:49:22Z

I have been doing some digging and testing (within my limited skill set). Could it be that the problem is in the polarity_scores function:

for chr in text:
    if chr in self.emojis:
        # get the textual description
        description = self.emojis[chr]
        if not prev_space:
            text_no_emoji += ' '
            text_no_emoji += description
            prev_space = False
        else:
            text_no_emoji += chr
            prev_space = chr == ' '
text = text_no_emoji.strip()

because the loop only looks at single characters, missing emojis with multiple code units? Although I have to admit that, after looking at the code, I do not understand how any emojis are found, because the above loop creates a string without emojis (replacing them with their description), which is then passed on to the sentiment_valence function, which only checks against the lexicon (which does not contain the emoji descriptions). But clearly I am missing something here.

foost · 2020-07-01T08:12:05Z

Found the problem, not sure about a fix. I understand now that the sentiment scores are derived from the emoji description words, like normal text, in the sentiment_valence function. So, in case of ☹️ ("frowning face") the word found in the lexicon is "frowning", and in case of ❤️ ("red heart") it's "heart". I tested it extensively, the former works (so the problem does not occur with all emojis with multiple code points), the latter does not.
The problem only occurs when the lexicon word is the last word of the emoji description, because the loop (see my previous comment) only looks at the first code point to find the description, but then adds the second code point to the last word of that description. This changes the last character of that unigram, making the sentiment_valence look-up miss it. The change is barely visible in control print outs (e.g. the letter "t" becomes a tiny bit smaller), that's why it took me so long to figure out what's going on.
How to fix it? Dealing properly with emojis with multiple code points would need some serious changes in the loop. I have chosen for a quick and dirty fix: Since the major culprit for sentiment-relevant emojis is "FE0F", I changed the "else:" statement into "elif ord(chr) != 65039:" to completely ignore it. Seems to work.

            if chr in self.emojis:
                # get the textual description
                description = self.emojis[chr]
                if not prev_space:
                    text_no_emoji += ' '
                text_no_emoji += description
                prev_space = False
            elif ord(chr) != 65039:
                text_no_emoji += chr
                prev_space = chr == ' '```

protrolium · 2021-04-21T18:35:35Z

Thank you for elucidating what is happening here with emoji interpretation. We are running tests and surprised that :red heart: etc. returned neutral scores consistently. I will try your fix for now, but agree that it is merely a substitute until the logic is more substantially addressed.

foost · 2021-06-01T07:14:51Z

Thank you for elucidating what is happening here with emoji interpretation. We are running tests and surprised that :red heart: etc. returned neutral scores consistently. I will try your fix for now, but agree that it is merely a substitute until the logic is more substantially addressed.

Thanks for getting back about this. After a hiatus in research, I plan to get back to sentiment analysis soon, so I'll watch this space and see whether there is anything I can contribute.

rehovicova · 2022-10-12T18:45:49Z

Hi, I am working on my project and came to the conclusion it is completely skipping emojis. It does not replace it with the text (ran it in debugger mode and if the sentence is an emoji it completely skips any evaluation). Any updates on this? It is really crucial for my research.

rezabrizi · 2023-02-04T16:29:25Z

@rehovicova
Did you ever find a fix for this?

foost changed the title ~~Emoji with multiple code points not detected~~ Emoji with multiple code units not detected Jun 3, 2020

foost mentioned this issue Jan 25, 2021

Vader does not predict correctly the sentiment of some emojis. #117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emoji with multiple code units not detected #99

Emoji with multiple code units not detected #99

foost commented Jun 3, 2020 •

edited

Loading

foost commented Jun 10, 2020 •

edited

Loading

foost commented Jul 1, 2020 •

edited

Loading

protrolium commented Apr 21, 2021

foost commented Jun 1, 2021

rehovicova commented Oct 12, 2022

rezabrizi commented Feb 4, 2023

Emoji with multiple code units not detected #99

Emoji with multiple code units not detected #99

Comments

foost commented Jun 3, 2020 • edited Loading

foost commented Jun 10, 2020 • edited Loading

foost commented Jul 1, 2020 • edited Loading

protrolium commented Apr 21, 2021

foost commented Jun 1, 2021

rehovicova commented Oct 12, 2022

rezabrizi commented Feb 4, 2023

foost commented Jun 3, 2020 •

edited

Loading

foost commented Jun 10, 2020 •

edited

Loading

foost commented Jul 1, 2020 •

edited

Loading