Building a Semantic Emoji Prediction NLU

Gregory Whiteside
7 min readJan 5, 2021

--

The iPhone’s built-in keyboard predicts emojis based on what seems to be mostly a mix of keyword-based matching and next token prediction

However, it’s not great at understanding the semantics of your text, i.e: the more subtle emotions that you might want to communicate via emojis

I often find myself either

  • typing out the magic word that I know will translate to the desired emoji (i.e: happy= 😊)
  • manually searching through the emoji keyboard
I know you find yourself here as well :)

LinkedIn (like other apps) also predict emojis, however every implementation I’ve seen seems to be keyword index-based.

I wondered what it would take for my keyboard to predict emojis that take into account the entire meaning of a sentence like this one (as you can see, the iPhone doesn’t propose any emojis in this case):

“I cannot say I really know what emoji this sentence should have” = NO emoji suggestions

Emojis are another dimension of information

Emojis are especially useful for conveying what’s not already explicitly said in the text — so keyword based recommendations fall short of being able to grasp the bigger context (and mood) of the utterance.

I’ll explore two ways to build an emoji recommendation engine based on semantic analysis of the text, i.e: attaching emojis based on the meaning of the text (rather than individual words).

Approach 1: Weakly Supervised

The first approach relies on “noisy” labeled data to train a classifier. Per Wikipedia’s definition of weak supervision :

Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting.This approach alleviates the burden of obtaining hand-labeled data sets, which can be costly or impractical. Instead, inexpensive weak labels are employed with the understanding that they are imperfect, but can nonetheless be used to create a strong predictive model. (https://en.wikipedia.org/wiki/Weak_supervision)

MIT applied this approach to build an emoji prediction model called DeepMoji:

In this case, the model relied on a large corpus of text already containing emojis that constituted the “pre-labeled” data.

This approach works if you have large amounts of already labeled data, however the “weak” supervision implies that you don’t control the quality of the labeled data. It’s a similar problem if you use the results of unsupervised clustering techniques to identify and train intents: while it’s a good first step, you end up with a lot of data (or clusters) you haven’t labeled yourself, so a lot of data-engineering work still needs to be put into fixing/cleaning to disambiguate and ensure desired behaviour.

Another disadvantage (or hurdle) is that it’s harder to bootstrap: getting that initial corpus of labeled data isn’t always easy (or possible), and extracting the labels and preparing it for a classification model is where a lot of time/effort can be sunk.

Approach 2: Supervised (cascading intents)

The second approach involves creating an intent classifier from scratch, where each intent is an emoji, and training examples are utterances where that emoji could be valid.

Tip: using a labeling and NLU data engineering tool (like HumanFirst) gives even non-technical people the means to do this type of work easily and quickly.

You can then train an out-of-the-box NLU model with this data in one click with DialogFlow, Luis, Rasa, HumanFirst etc

Divide and conquer

The emoji-intent discovery and creation process is easier when organizing intents in a cascading hierarchy.

The idea behind this strategy is to first identify and train a few emojis that that represent the broad “buckets” that all utterances can fall into. For example:

  • 🙂 (positive)
  • 🙁 (negative)
  • ❓ (inquisitive)

Adding one of these emojis to an utterance will typically be “valid” (even if it’s not the most specific emoji one could or want to use)

Indeed, the following 3 sentences are semantically very different, however adding the 🙁 emoji to any one of them would still be valid:

  • “No current and food here. I am alone also 🙁
  • “Babe? You said 2 hours and it’s been almost 4 … Is your internet down ? 🙁
  • “It is an emergency I really need to get on that flight. 🙁

Once this initial list of general emoji intents is defined, more specific emoji intents can be organized as children underneath them.

For example, you will be 🙂 when:

  • 👍 (encouraging, thanking etc): “Great, that’s all I need to know 👍🙂
  • 🎉 (delighted, good news, celebration etc): “So that takes away some money worries 🎉🙂
  • 👋 (engaging, notifying, following-up etc): “I was just calling to say hi 👋🙂

And when you 👍, it might be because:

  • 🤓 (learning, teaching): hahaha maybe I can teach you 🤓👍🙂”
  • 🙏 (thankful): “Okay, I think that’s all I needed 🙏👍🙂
  • 😎 (you’re proud, colloquial): “you guys are absolutely crushing it 😎👍🙂

Cascading intents allow you to match related intents at different levels of “abstraction” for the same utterance.

Below, you can see training examples on the right for the 🤓 intent (and its parent intents):

The training examples on the right are for 🤓 (but 👍 and 🙂 will also be predicted)

So the problem is simply one of creating intents for all the types of semantic “emoji states” you want to match, and finding and labeling 15–30 training phrases for each of those intents: with that training data, you can then quickly and easily train an NLU classifier on DialogFlow, Luis, HumanFirst or others.

What makes this approach (traditionally) difficult is:

  • Data-engineering the intent hierarchy (i.e: deciding what intents need to be trained, and how to organize this data)
  • labeling or sourcing training examples for each intent

Building this type of intent hierarchy is part art, part science: when done well, it provides a very simple and powerful way to extend and accommodate new intents over time at various levels of abstraction — while maintaining complete control and understanding of the AI output.

Tip: Using bottom-up NLU techniques can also drastically accelerate the process, check this blog post to read more: https://medium.com/humanfirst-blog/a-bottom-up-approach-to-intent-discovery-and-training-4abf21f1624a

Curating the NLU dataset

I spent ~4 hours building and training the first POC dataset using HumanFirst, and a few more hours over the Christmas holidays — definitely less time building the data than writing this blog post ;)

For the training examples, I labeled utterances from public datasets that didn’t have any emojis to start with:

I describe the dataset and go into a a bit more detail below:

Example Prediction Results

Test: “All I can say is good luck”
Cascading NLU: |🙂|👍|🙏|😎|💪
MIT: |🙏|👍|✌️|👌|💯
iPhone: 😉

Test: “I cannot say that I really know what emoji this sentence should have”
Cascading NLU: |❓|🤔|🤷‍♀️|😅|😥
MIT: |🙅|😠|😑|😐|😬
iPhone: N/A

Test: “Can you please tell me more about that?”
Cascading NLU: |❓|ℹ️| 🤔|🤷‍♀️
MIT:|😞|😫|😓|😕|😒
iPhone: N/A

Test: “this is crazy”
Cascading NLU: |🙂|👍|😎|🙁|😠
MIT: |😓|😅|😫
iPhone: |😜|😝

While this dataset is still a very small POC (a few hours of labeling and data engineering), it’s still exciting to see the NLU’s predictions still feel “right” and “human”, even with more abstract (and keyword-less) utterances like the ones above.

You can download the raw data here (CSV) —let me know if you use this data to train the model on another NLU platform (i.e: DialogFlow, Luis, Watson etc)!

You can also test the pre-trained Cascading NLU model yourself directly in HumanFirst, it’s available as a demo workspace (Don’t have an account? Signup for free)

Advantages of a cascading NLU approach

The dataset powering this POC contains 900 training examples over 64 intents: it’s a small dataset, but it already works pretty well — and it shows how easy it is to build an understandable NLU model that can scale to hundreds or thousands of intents.

Is this useful?

LinkedIn, Slack and others (let me know if you’re interested!) could extend and incorporate this approach into their emoji recommendations (with additional labeling and data-engineering) — so yes 😅

As for me, until iOS makes it possible to plug this emoji predictor into the native keyboard, it’s likely that this will simply remain a fun experiment / POC than an everyday enhancement:)

However, the intent data itself is quite interesting when de-coupled from the emoji use-case, as it represents “generalized” emotions/topics that can power conversation analytics.

I’ll explore how to apply this data to text analytics use-cases in a future blog post — stay tuned :)

--

--

Gregory Whiteside
Gregory Whiteside

Written by Gregory Whiteside

CEO at HumanFirst.ai, dad, songwriter, amateur tennis player and ad-hoc participant in a few other sports and hobbies :)

No responses yet