This series of notebooks is aimed at helping fellow NLP enthusiasts think about applying new tools and techniques to practical tasks. My goal is to keep the code and work flow simple, and focus on actual use cases.
Notebook5.0 is an adaptation of my new repo on using transformer models to detect state trolls on Twitter. I reckon many may not be interested in the subject matter, so I only ported over the Colab notebook for fine tuning with custom dataset for folks who are specifically looking for examples like this.
This notebook took about 5.5 hours to run on a Colab Pro account on TPU and "high-RAM" settings. It could go slower or faster depending on your set-up. The datasets needed - train_raw.csv and validate.csv - are in the data folder of this repo.
Machine translation doesn't generate as much excitement as other emerging areas in NLP, but recent advances have opened up interesting new possibilities in this space. Over 5 short notebooks, I'll demo a simple workflow for using Hugging Face's version of MarianMT, as well as Facebook's Fairseq toolkit for translation.
The HF-MMT demos cover:
- translate 3 English speeches of varying lengths to Chinese
- translate 5 English news stories on Covid-19 (under 500 words) to Chinese
- translate 3 Chinese speeches to English
The FB-Fairseq demos cover (Added Dec 29 2020):
Results from neural machine translation models are not (yet) as artful or precise as those by a skilled human translator. But they get 60% or more of the job done, in my view. Depending on your use case, that could be a huge time saver.
Fuller background and details in this Medium post here.
AI text generation is one of the most exciting fields in NLP, but also a daunting one for beginners. These 4 notebooks aim to speed up the learning process for newcomers by combining and adapting various existing tutorials into a practical end-to-end walkthrough with notebooks and sample data for a conversational chatbot that can be used in an interactive app.
- 3.0: Data preparation
- 3.1: Fine tuning a pretrained DialoGPT-medium model on Colab
- 3.2: Testing the model's performance on an interactive Dash app
- 3.3: CPU alternative to text generation
Fuller background and details in this Medium post here.
Text summarization is a far less common downstream NLP task compared to, say, classification or sentiment analysis. The resources and time needed to do it well are considerable. Hugging Face's transformers pipeline, however, has made the first part of the task much faster and efficient. More time can then be devoted to analysing the results, and/or building your own benchmarks for assessing the summaries. This notebook incorporates minor work-arounds to handle longer speeches, which is trickier to handle due to sequence length limits in the transformer models/pipeline.
Fuller background and details in this Medium post here.
Sentiment analysis is a fairly common task in machine learning. Hugging Face's new pipeline feature, however, has made it incredibly easy to use a transformer-based model for this task. In this notebook, I'll explore how the HF pipeline can be used together with Plotly and Google Sheets to produce a detailed analysis of one speech, as well as how the same technique can be adapted for longer-term analysis of political speeches on one topic, or those by a common group of speakers.
Fuller background in this post here.