Twitter Tweet Classification Using BERT

Twitter Tweet Classification Using BERT1. Introduction and Background2. BERT3. Dataset4. Tweet Preprocessing5. Classification Model5.1 Baseline Model with LSTM and GloVE Embedding5.2 BERT Sequence Classification Model5.3 BERT with LSTM Classification Model5.4 Helper Functions6. Model Training7. Results

1. Introduction and Background

During a mass casualty event such as a natural disaster or a mass shooting, social networks such as Twitter or Facebook act as a conduit of information. These information include location and type of personal injury, infrastructure damage, donations, advice, and emotional support. Useful information can be harnessed by first responders and agencies to assess damage and coordinate rescue operations. However, the speed and the mass at which the information come in presents a challenge to rescue personnel to discern the relevant ones from extraneous ones. In this light, we want to create a machine learning model that can automatically classify social media posts into different categories. The model aims to extract useful information from a sea of posts, and sort the useful information into several classes.

In this project, we will classify Twitter tweets during natural disasters. We attempt to classify tweets into 7 categories:

  1. not informative
  2. caution and advice
  3. affected individuals
  4. infrastructure and utilities damage
  5. donations and volunteering
  6. sympathy and support
  7. other useful information

Our baseline model will be a LSTM model using the Stanford GloVE Twitter word embedding. We will compare the base model with a Google BERT base classifier model and BERT model modified with an LSTM. The models will be written in Pytorch.

2. BERT

In natural language processing, a word is represented by a vector of numbers before input into a machine learning model for processing. These word vectors are commonly referred to as word embeddings. A word embedding should not be a random vector, but rather be able to express the meaning of the word or relations between different words. For example, the embedding for "queen" subtract the embedding for "woman" and then add the embedding for "man" should result in the embedding for "king".

Thus these word representations allow mathematical operations to be conducted on the "meaning" of the words in the machine learning model, and allows the model to discern the overall meaning of an utterance.

Word representations are often trained on large text corpus based on co-occurrence statistics, which are the number of times words appear close to each other in a sentence. Pre-trained presentations can be broadly into two classes, contextual or non-contextual. Contextual representations can be further divided into unidirectional or bidirectional .

Noncontextual representations use the same vector to represent words that have multiple meanings, regardless of which context the words appears in. For example, in both of the following sentences, "He flipped the light switch", and "His steps were light ", the same vector is used for the word "light". This is despite the fact that based on context, the same word have completely different meanings. Commonly used word2vec and GloVE  embeddings belong to this class of representation.

Contextual representations would use different vectors to represent a word based on the context in which the word appears in. This type of representation is more accurate and performs better This class of word representation can be divided into unidirectional or bidirectional. In unidirectional contextual representations, only the context on one side of the word, either left or right, is used to generate the representation. The ELMo representation belongs to the class of unidirectional contextual word representations.

BERT, or Bidirectional Encoder Representations from Transformers, is a bidirectional contextual word representation. Bidirectional representations are better because the meaning of a words depends on both the words before and after it. Unidirectional contextual embeddings are commonly generated by scanning forward or backwards from the word of interest. Scanning both sides of the word of interest together would cause the word to "see itself" in the context. BERT solves this problem by first masking a certain percentage of the words in the sequence, then use an bidirectional encoder called transformers to scan the entire sequence, and finally predict the masked words. For example,

Input: He took his children to the [mask1] to read some books after [mask2] ended.

Labels: mask1 = library, mask2 = school

 

3. Dataset

Labelled tweets are gathered from 2 online sources, CrisisNLP, and CrissisLex. Each website contains several repositories. Each repository contains tweets gathered during various natural disasters in a csv file. We will only choose English tweets. The labels from each repository are not identical but largely similar. We will map the labels to a set of common labels.

4. Tweet Preprocessing

Lets preprocess the tweets into the appropriate format before feeding them into the network. We download the English tweets from the CrisisNLP and CrisisLex online repositories and story them in separate file directories.

First we list all the csv files we have saved in the crisisNLP directory.

 

All the csv files are loaded and appended into a single Pandas data frame .

 

Lets examine the csv file.

_unit_id_golden_unit_state_trusted_judgments_last_judgment_atchoose_one_categorychoose_one_category:confidencechoose_one_category_goldtweet_idtweet_text
879819338Falsefinalized32/11/2016 11:41:40other_useful_information0.3429NaN'382813388503412736'USGS reports a M1.7 #earthquake 70km ESE of He...
879819339Falsefinalized32/11/2016 09:52:13not_related_or_irrelevant0.6345NaN382813391435206659QuakeFactor M 3.0, Southern Alaska: Sunday, Se...

As on can see there are 10 columns, the relevant ones are the "choose_one_category", "choose_one_category:confidence", and "tweet_text" columns. In particular, the "choose_one_category" is the tweet labels, and the "choose_one_category:confidence" is the confidence of the labels.

We will drop the columns that are not relevant, and also only keep the rows that have confidence level greater than 0.6.

We do the same for CrisisLex data. We drop the columns not needed by us.

We combine the two data frames into a common data frame. But before we do that, we need to give the same name to the corresponding columns in both data frames. We will give the tweet column the name "tweet" and category column the name "type". We will drop the confidence column in CrisisNLP and finall concatenate the two data frames.

Now that we have stripped the necessary data from the csv file, we need to clean up the tweets. We will

  1. Remove the URL linkes
  2. Remove the "RT", "&", "<", ">", and "@" symbles
  3. Remove non-ascii words
  4. Remove extra spaces
  5. Insert spaces between punctuation marks
  6. Strip leading and ending spaces
  7. Convert all words to lower-case

We will calculate the length of each tweet and only keep unique tweets that are 3 words or longer.

Now we have our set of working data. We will consolidate all the labels into the following 7 labels.

  1. Not informative
  2. Other useful information
  3. Caution and advice
  4. Infrastructure and Utilities damage
  5. donations and volunteering
  6. sympathy and support
  7. affected individuals

Afterwards, we will assign a numeric type to each of the labels,

The final step is to preprocess the tweet texts for input into the BERT classifier. BER classifier requires the input be prefixed by the "[CLS]" token. We will also tokenize the tweet text with the BERT Tokenizer and calculate the length of the tokenized text.

5. Classification Model

The models will be programmed using Pytorch. We will compare 3 different classification models. The baseline model is a LSTM network using the GloVE twitter word embedding. It will be compared with two BERT based model. The basic BERT model is the pretrained BertForSequenceClassification model. We will be finetuning it on the twitter dataset. The second BERT based model stacks a LSTM on top of BERT.

5.1 Baseline Model with LSTM and GloVE Embedding

We use a single layer bi-directional LSTM neural network model as our baseline. The hidden size of the LSTM cell is 256. Tweets are first embedded using the GloVE Twitter embedding with 50 dimensions. The stacked final state of the LSTM cell is linked to a softmax classifier through a fully connected layer.

 

5.2 BERT Sequence Classification Model

We finetune the basic BERT sequence classification model, BertForSequenceClassification. We use the English BERT uncased base model, which has 12 transformer layers, 12 self-attention heads, and a hidden size of 768.

 

5.3 BERT with LSTM Classification Model

We stack a bidirectional LSTM on top of BERT. The input to the LSTM is the BERT final hidden states of the entire tweet. Then, as the baseline model, the stacked hidden states of the LSTM is connected to a softmax classifier through a affine layer.

 

5.4 Helper Functions

These are the helper functions used to convert batches of tweets to tensors, and pad to a common length in the process.

 

6. Model Training

We use BertAdam optimizer for all BERT related models. The initial learning rate for the BERT model is set to 0.00002 and for non-BERT part of the model is set to 0.001. The batch-size is set to 64. Patients is set to 5, and the learning rate is reduced by half after reaching this number. The training is stopped early if 5 reduction in learning rate occurred. Cross entropy loss is used as the loss function. With each category weighted by .

7. Results

We examine the accuracy and the Matthews correlation coefficient of all the models. Both BERT models out perform the baseline model with GloVE embedding, as expected. The LSTM BERT model, however, has a lower accuracy and MCC score than the base BERT sequence classification model.

ModelAccuracyMatthews Correlation Coefficient
Baseline0.63230.5674
Base BERT0.69480.6401
LSTM BERT0.68530.6311

We give the confusion matrix below. All models have trouble properly classifying the "not related or not informative" class. The accuracy for that class is around 50%.

Baseline Model Confusion Matrix

Baseline Model Confusion Matrix

 

BERT Sequence Classification Confusion Matrix

 

LSTM + BERT Confusion Matrix