Twitter Tweet Classification Using BERT

Twitter Tweet Classification Using BERT1. Introduction and Background2. BERT3. Dataset4. Tweet Preprocessing5. Classification Model5.1 Baseline Model with LSTM and GloVE Embedding5.2 BERT Sequence Classification Model5.3 BERT with LSTM Classification Model5.4 Helper Functions6. Model Training7. Results

1. Introduction and Background

During a mass casualty event such as a natural disaster or a mass shooting, social networks such as Twitter or Facebook act as a conduit of information. These information include location and type of personal injury, infrastructure damage, donations, advice, and emotional support. Useful information can be harnessed by first responders and agencies to assess damage and coordinate rescue operations. However, the speed and the mass at which the information come in presents a challenge to rescue personnel to discern the relevant ones from extraneous ones. In this light, we want to create a machine learning model that can automatically classify social media posts into different categories. The model aims to extract useful information from a sea of posts, and sort the useful information into several classes.

In this project, we will classify Twitter tweets during natural disasters. We attempt to classify tweets into 7 categories:

not informative
caution and advice
affected individuals
infrastructure and utilities damage
donations and volunteering
sympathy and support
other useful information

Our baseline model will be a LSTM model using the Stanford GloVE Twitter word embedding. We will compare the base model with a Google BERT base classifier model and BERT model modified with an LSTM. The models will be written in Pytorch.

2. BERT

In natural language processing, a word is represented by a vector of numbers before input into a machine learning model for processing. These word vectors are commonly referred to as word embeddings. A word embedding should not be a random vector, but rather be able to express the meaning of the word or relations between different words. For example, the embedding for "queen" subtract the embedding for "woman" and then add the embedding for "man" should result in the embedding for "king".

emb(queen)-emb(woman)+emb(man)=emb(king)

Thus these word representations allow mathematical operations to be conducted on the "meaning" of the words in the machine learning model, and allows the model to discern the overall meaning of an utterance.

Word representations are often trained on large text corpus based on co-occurrence statistics, which are the number of times words appear close to each other in a sentence. Pre-trained presentations can be broadly into two classes, contextual or non-contextual. Contextual representations can be further divided into unidirectional or bidirectional .

Noncontextual representations use the same vector to represent words that have multiple meanings, regardless of which context the words appears in. For example, in both of the following sentences, "He flipped the light switch", and "His steps were light ", the same vector is used for the word "light". This is despite the fact that based on context, the same word have completely different meanings. Commonly used word2vec and GloVE embeddings belong to this class of representation.

Contextual representations would use different vectors to represent a word based on the context in which the word appears in. This type of representation is more accurate and performs better This class of word representation can be divided into unidirectional or bidirectional. In unidirectional contextual representations, only the context on one side of the word, either left or right, is used to generate the representation. The ELMo representation belongs to the class of unidirectional contextual word representations.

BERT, or Bidirectional Encoder Representations from Transformers, is a bidirectional contextual word representation. Bidirectional representations are better because the meaning of a words depends on both the words before and after it. Unidirectional contextual embeddings are commonly generated by scanning forward or backwards from the word of interest. Scanning both sides of the word of interest together would cause the word to "see itself" in the context. BERT solves this problem by first masking a certain percentage of the words in the sequence, then use an bidirectional encoder called transformers to scan the entire sequence, and finally predict the masked words. For example,

Input: He took his children to the [mask1] to read some books after [mask2] ended.
Labels: mask1 = library, mask2 = school

3. Dataset

Labelled tweets are gathered from 2 online sources, CrisisNLP, and CrissisLex. Each website contains several repositories. Each repository contains tweets gathered during various natural disasters in a csv file. We will only choose English tweets. The labels from each repository are not identical but largely similar. We will map the labels to a set of common labels.

4. Tweet Preprocessing

Lets preprocess the tweets into the appropriate format before feeding them into the network. We download the English tweets from the CrisisNLP and CrisisLex online repositories and story them in separate file directories.

First we list all the csv files we have saved in the crisisNLP directory.


xxxxxxxxxx
import os
import pandas as pd
data_dir = 'data/crisisNLP'
files_csv = []
for file in os.listdir(data_dir):
    if file.endswith(".csv"):
        files_csv.append(data_dir+file)

All the csv files are loaded and appended into a single Pandas data frame .


x
crisisNLP_df = pd.DataFrame()        
for file in files_csv:
    temp_df = pd.read_csv(file, encoding='utf-8')
    crisisNLP_df = pd.concat([crisisNLP_df, temp_df])

Lets examine the csv file.


xxxxxxxxxx
crisisNLP_df.head(10)

_unit_id	_golden	_unit_state	_trusted_judgments	_last_judgment_at	choose_one_category	choose_one_category:confidence	choose_one_category_gold	tweet_id	tweet_text
879819338	False	finalized	3	2/11/2016 11:41:40	other_useful_information	0.3429	NaN	'382813388503412736'	USGS reports a M1.7 #earthquake 70km ESE of He...
879819339	False	finalized	3	2/11/2016 09:52:13	not_related_or_irrelevant	0.6345	NaN	382813391435206659	QuakeFactor M 3.0, Southern Alaska: Sunday, Se...

As on can see there are 10 columns, the relevant ones are the "choose_one_category", "choose_one_category:confidence", and "tweet_text" columns. In particular, the "choose_one_category" is the tweet labels, and the "choose_one_category:confidence" is the confidence of the labels.

We will drop the columns that are not relevant, and also only keep the rows that have confidence level greater than 0.6.


xxxxxxxxxx
drop_cols = ['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'choose_one_category_gold',
       'tweet_id']
crisisNLP_df.drop(columns=drop_cols, inplace=True)
crisisNLP_df[crisisNLP_df['choose_one_category:confidence'] >=0.6]

We do the same for CrisisLex data. We drop the columns not needed by us.


x
data_dir = 'data/CrisisLexT26/'
files_csv = []
for (dirpath, dirnames, filenames) in os.walk(data_dir):
    for file in filenames:
        if file.endswith("labeled.csv"):
            files_csv.append(dirpath+'/'+file)
            
crisisLex_df = pd.DataFrame()        
for file in files_csv:
    temp_df = pd.read_csv(file, encoding='utf-8')
    crisisLex_df = pd.concat([crisisLex_df, temp_df])
drop_cols = ['Tweet ID',' Information Source',
       ' Informativeness']
crisisLex_df.drop(columns=drop_cols, inplace=True)

We combine the two data frames into a common data frame. But before we do that, we need to give the same name to the corresponding columns in both data frames. We will give the tweet column the name "tweet" and category column the name "type". We will drop the confidence column in CrisisNLP and finall concatenate the two data frames.


x
crisisLex_df.rename(index=str, columns={" Information Type":   "type", " Tweet Text": "tweet"}, inplace=True)
crisisNLP_df.rename(index=str, columns={"choose_one_category": "type",  "tweet_text": "tweet"}, inplace=True)
crisisNLP_df.drop(columns='choose_one_category:confidence', inplace=True)
df=pd.concat([crisisLex_df,crisisNLP_df])

Now that we have stripped the necessary data from the csv file, we need to clean up the tweets. We will

Remove the URL linkes
Remove the "RT", "&", "<", ">", and "@" symbles
Remove non-ascii words
Remove extra spaces
Insert spaces between punctuation marks
Strip leading and ending spaces
Convert all words to lower-case


xxxxxxxxxx
#remove URL
df['tweet_proc'] = df['tweet'].str.replace(r'http(\S)+', r'')
df['tweet_proc'] = df['tweet_proc'].str.replace(r'http ...', r'')
df['tweet_proc'] = df['tweet_proc'].str.replace(r'http', r'')
df[df['tweet_proc'].str.contains(r'http')]
# remove RT, @
df['tweet_proc'] = df['tweet_proc'].str.replace(r'(RT|rt)[ ]*@[ ]*[\S]+',r'')
df[df['tweet_proc'].str.contains(r'RT[ ]?@')]
df['tweet_proc'] = df['tweet_proc'].str.replace(r'@[\S]+',r'')
#remove non-ascii words and characters
df['tweet_proc'] = [''.join([i if ord(i) < 128 else '' for i in text]) for text in df['tweet_proc']]
df['tweet_proc'] = df['tweet_proc'].str.replace(r'_[\S]?',r'')
#remove &, < and >
df['tweet_proc'] = df['tweet_proc'].str.replace(r'&amp;?',r'and')
df['tweet_proc'] = df['tweet_proc'].str.replace(r'&lt;',r'<')
df['tweet_proc'] = df['tweet_proc'].str.replace(r'&gt;',r'>')
# remove extra space
df['tweet_proc'] = df['tweet_proc'].str.replace(r'[ ]{2, }',r' ')
# insert space between punctuation marks
df['tweet_proc'] = df['tweet_proc'].str.replace(r'([\w\d]+)([^\w\d ]+)', r'\1 \2')
df['tweet_proc'] = df['tweet_proc'].str.replace(r'([^\w\d ]+)([\w\d]+)', r'\1 \2')
# lower case and strip white spaces at both ends
df['tweet_proc'] = df['tweet_proc'].str.lower()
df['tweet_proc'] = df['tweet_proc'].str.strip()

We will calculate the length of each tweet and only keep unique tweets that are 3 words or longer.


x
df['tweet_proc_length'] = [len(text.split(' ')) for text in df['tweet_proc']]
df['tweet_proc_length'].value_counts()
df = df[df['tweet_proc_length']>3]
df = df.drop_duplicates(subset=['tweet_proc'])

Now we have our set of working data. We will consolidate all the labels into the following 7 labels.

Not informative
Other useful information
Caution and advice
Infrastructure and Utilities damage
donations and volunteering
sympathy and support
affected individuals

Afterwards, we will assign a numeric type to each of the labels,


xxxxxxxxxx
type_map={'Not labeled': 'not informative', 
            'Other Useful Information': 'other useful information', 
            'Caution and advice': 'caution and advice',
            'Affected individuals': 'affected individuals', 
            'Infrastructure and utilities': 'infrastructure and utilities damage',
            'Donations and volunteering': 'donations and volunteering',
            'Sympathy and support':'sympathy and support',
            'Not applicable': 'not informative',
            'other_useful_information': 'other useful information',
            'not_related_or_irrelevant': 'not informative',
            'donation_needs_or_offers_or_volunteering_services': 'donations and volunteering',
            'injured_or_dead_people':'affected individuals', 
            'missing_trapped_or_found_people':'affected individuals',
            'caution_and_advice':'caution and advice', 
            'infrastructure_and_utilities_damage':'infrastructure and utilities damage',
            'sympathy_and_emotional_support':'sympathy and support',
            'displaced_people_and_evacuations':'affected individuals'
            }
df['type'] = df['type'].map(type_map)
df['type'].value_counts()
label_dict = dict()
for i, l in enumerate(list(df.type.value_counts().keys())):
    label_dict.update({l: i})
# for each unique label, assign a numeric identiifer
df['type_label'] = [label_dict[label] for label in df.type] #create a column in df to store the numeric ids

The final step is to preprocess the tweet texts for input into the BERT classifier. BER classifier requires the input be prefixed by the "[CLS]" token. We will also tokenize the tweet text with the BERT Tokenizer and calculate the length of the tokenized text.


x
df['tweet_proc_bert'] = '[CLS] '+df['tweet_proc']
from pytorch_pretrained_bert import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
df['tweet_proc_BERTbase_length'] = [len(tokenizer.tokenize(sent)) for sent in df['tweet_proc_bert']]

5. Classification Model

The models will be programmed using Pytorch. We will compare 3 different classification models. The baseline model is a LSTM network using the GloVE twitter word embedding. It will be compared with two BERT based model. The basic BERT model is the pretrained BertForSequenceClassification model. We will be finetuning it on the twitter dataset. The second BERT based model stacks a LSTM on top of BERT.

5.1 Baseline Model with LSTM and GloVE Embedding

We use a single layer bi-directional LSTM neural network model as our baseline. The hidden size of the LSTM cell is 256. Tweets are first embedded using the GloVE Twitter embedding with 50 dimensions. The stacked final state of the LSTM cell is linked to a softmax classifier through a fully connected layer.


x
import torch
import torch.nn as nn
import torch.nn.utils
from torch.nn.utils.rnn import pack_padded_sequence
import sys
import pickle
from vocab import VocabEntry
import numpy as np
class BaselineModel(nn.Module):
    def __init__(self, rnn_state_size, embedding, vocab, num_tweet_class, dropout_rate=0):
        """
        @param hidden_size (int): size of lstm hidden layer
        @param embedding (torch.Tensor): shape (num_of_words_in_dict, embed_dim), glove embedding matrix
        @param vocab (VocabEntry): GloVe word dictionary/index
        @param num_tweet_class (int): number of labels / classes
        @param dropout_rate (float): dropout rate for training
        """
        super(BaselineModel, self).__init__()
        self.rnn_state_size = rnn_state_size
        self.embed_dim = embedding.size(1)
        self.vocab = vocab
        self.num_tweet_class = num_tweet_class
        self.padding_idx = self.vocab['<pad>']
        self.dropout_rate = dropout_rate
        self.embedding_layer = nn.Embedding.from_pretrained(embeddings = embedding, 
                                                            freeze=True, # freeze weights from training
                                                            padding_idx=self.padding_idx, 
                                                            max_norm=None, 
                                                            norm_type=2.0, 
                                                            scale_grad_by_freq=False, 
                                                            sparse=False)
        # Create a embedding using GloVE pretrained weights
        self.lstm = nn.LSTM(input_size = self.embed_dim, 
                            hidden_size = self.rnn_state_size, 
                            num_layers = 1, 
                            bias = True, 
                            batch_first = False, 
                            dropout = 0, 
                            bidirectional = True #use a bi-directional LSTM
                           )
        self.affine = nn.Linear(in_features = 2*self.rnn_state_size, #the hidden stats of the biLSTM is stacked
                                out_features = self.num_tweet_class, 
                                bias=True)
        # fully connected layer before softmax
        self.dropout = nn.Dropout(p=self.dropout_rate)
    def forward(self, sents):
        """
        @param sents (list[list[str]]): a list of a list of words, sorted in descending length
        @return output (torch.Tensor): logits to put into softmax function to calculate prob
        """
        text_lengths = torch.tensor([len(sent) for sent in sents])
        sents_tensor = self.vocab.to_input_tensor(sents)  # Convert from list to tensor (max_sent_length, batch_size)
        x_embed = self.embedding_layer(sents_tensor)  # create embedding for words (max_sent_length, batch_size, embed_size)
        seq = pack_padded_sequence(x_embed.float(), text_lengths)
        enc_hiddens, (last_hidden, last_cell) = self.lstm(seq)
        output_hidden = torch.cat((last_hidden[0], last_hidden[1]), dim=1)  # (batch_size, 2*hidden_size)
        output_hidden = self.dropout(output_hidden)
        output = self.affine(output_hidden)  # (batch_size, n_class)
        return output
    @staticmethod
    def load(model_path: str):
        """ Load the model from a file.
        @param model_path (str): path to model
        @return model (nn.Module): model with saved parameters
        """
        params = torch.load(model_path, map_location=lambda storage, loc: storage)
        args = params['args']
        model = BaselineModel(vocab=params['vocab'], embedding=params['embedding'], **args)
        model.load_state_dict(params['state_dict'])
        return model
    def save(self, path: str):
        """ Save the model to a file.
        @param path (str): path to the model
        """
        print('save model parameters to [%s]' % path, file=sys.stderr)
        params = {
            'args': dict(rnn_state_size=self.rnn_state_size,    
                         dropout_rate=self.dropout_rate,
                         num_tweet_class=self.num_tweet_class),
            'vocab': self.vocab,
            'embedding': self.embedding_layer.weight,
            'state_dict': self.state_dict()
        }
        torch.save(params, path)

5.2 BERT Sequence Classification Model

We finetune the basic BERT sequence classification model, BertForSequenceClassification. We use the English BERT uncased base model, which has 12 transformer layers, 12 self-attention heads, and a hidden size of 768.


x
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForSequenceClassification
import torch
from torch.nn.utils.rnn import pack_padded_sequence
import sys
import numpy as np
class default_bert(torch.nn.Module):
    def __init__(self, num_class, bert_config='bert-base-uncased'):
        """
        :param num_class: number of classification categories
        :param bert_config: str, BERT model used
        """
        
        super(default_bert, self).__init__()
        
        self.num_class = num_class
        self.bert_config = bert_config
        self.model = BertForSequenceClassification.from_pretrained(self.bert_config, num_labels=self.num_class)
        self.tokenizer = BertTokenizer.from_pretrained(self.bert_config)        
    def forward(self, sents):
        """
        :param sents: list[str], list of untokenized sentences 
        """
        sents_tensor, masks_tensor, sents_lengths = sents_to_tensor(self.tokenizer, sents) #helpfer function to convert string to tensors, and create padding mask
        output = self.model(input_ids=sents_tensor, 
                            attention_mask=masks_tensor)
        return output       
    
    @staticmethod
    def load(model_path: str):
        """ Load the model from a file.
        @param model_path (str): path to model
        @return model (nn.Module): model with saved parameters
        """
        params = torch.load(model_path, map_location=lambda storage, loc: storage)
        args = params['args']
        model = default_bert(**args)
        model.load_state_dict(params['state_dict'])
        return model
    def save(self, path: str):
        """ Save the model to a file.
        @param path (str): path to the model
        """
        print('save model parameters to [%s]' % path, file=sys.stderr)
        params = {
            'args': dict(bert_config=self.bert_config, num_class=self.num_class),
            'state_dict': self.state_dict()
        }
        torch.save(params, path)

5.3 BERT with LSTM Classification Model

We stack a bidirectional LSTM on top of BERT. The input to the LSTM is the BERT final hidden states of the entire tweet. Then, as the baseline model, the stacked hidden states of the LSTM is connected to a softmax classifier through a affine layer.


x
class LSTM_bert(torch.nn.Module):
    def __init__(self, num_class, dropout_rate, bert_config='bert-base-uncased'):
        """
        :param num_class: int, number of categories
        :param bert_config: str, BERT configuration description
        :param dropout_rate: float
        """
        
        super(LSTM_bert, self).__init__()
        
        self.num_class = num_class
        self.bert_config = bert_config
        self.tokenizer = BertTokenizer.from_pretrained(self.bert_config)          
        self.bert = BertModel.from_pretrained(self.bert_config)
        self.dropout_rate = dropout_rate
        self.lstm_input_size = self.bert.config.hidden_size
        self.lstm_hidden_size = self.bert.config.hidden_size
        self.lstm = torch.nn.LSTM(input_size = self.lstm_input_size,
                                  hidden_size = self.lstm_hidden_size,
                                  bidirectional = True)
        
        self.dropout = torch.nn.Dropout(p=self.dropout_rate)
        self.fc = torch.nn.Linear(in_features = 2*self.lstm_hidden_size, #LSTM stacked hidden state
                                  out_features = self.num_class, 
                                  bias=True)
    def forward(self, sents):
        """
        :param sents: list[str], list of untokenized sentences 
        :return: torch.tensor of shape (batch_size, num_class)
        """
        sents_tensor, masks_tensor, sents_lengths = sents_to_tensor(self.tokenizer, sents)
        # 1. The tweet is first input to the model
        encoded_layers, pooled_output = self.bert(input_ids=sents_tensor, 
                                                  attention_mask=masks_tensor,
                                                  output_all_encoded_layers=False)
        # 2. The output is reshuffled to the correct format
        encoded_layers = encoded_layers.permute(1, 0, 2) #permute dimensions to fit LSTM input
        # 3. The encoded layers are fed into the biLSTM
        enc_hiddens, (last_hidden, last_cell) = self.lstm(pack_padded_sequence(encoded_layers, sents_lengths))
        
        # 4. final hidden states of the biLSTM is concatenated together
        output_hidden = torch.cat((last_hidden[0,:,:], last_hidden[1,:,:]), dim=1)
        # h_n of shape (num_layers * num_directions, batch, hidden_size)
        
        # 5. Dropout applied
        output_hidden = self.dropout(output_hidden)
        
        # 6. Affine layer before softmax
        output = self.fc(output_hidden)
        
        return output       
    
    @staticmethod
    def load(model_path: str):
        """ Load the model from a file.
        @param model_path (str): path to model
        @return model (nn.Module): model with saved parameters
        """
        params = torch.load(model_path, map_location=lambda storage, loc: storage)
        args = params['args']
        model = LSTM_bert(**args)
        model.load_state_dict(params['state_dict'])
        return model
    def save(self, path: str):
        """ Save the model to a file.
        @param path (str): path to the model
        """
        print('save model parameters to [%s]' % path, file=sys.stderr)
        params = {
            'args': dict(bert_config=self.bert_config, num_class=self.num_class, dropout_rate=self.dropout_rate),
            'state_dict': self.state_dict()
        }
        torch.save(params, path)

5.4 Helper Functions

These are the helper functions used to convert batches of tweets to tensors, and pad to a common length in the process.


x
def pad_sents(sents, pad_token):
    """ Pad list of sentences to the longest length in the batch.
    @param sents (list[list[str]]): list of tokenized strings
    @param pad_token (int): pad token
    @returns sents_padded (list[list[int]]): list of tokenized sentences with padding shape: (batch_size, max_sentence_length)
    """
    sents_padded = []
    max_len = max(len(s) for s in sents)
    for s in sents:
        padded = [pad_token] * max_len
        padded[:len(s)] = s
        sents_padded.append(padded)
    return sents_padded
def sents_to_tensor(tokenizer, sents):
    """
    :param tokenizer
    :param sents: list[str], list of untokenized strings
    """
    tokens_list = [tokenizer.tokenize(sent) for sent in sents]
    sents_lengths = [len(tokens) for tokens in tokens_list]
    sents_lengths = torch.tensor(sents_lengths)
    
    tokens_list_padded = pad_sents(tokens_list, '[PAD]')
    masks = np.asarray(tokens_list_padded)!='[PAD]'
    masks_tensor = torch.tensor(masks, dtype=torch.long)
    tokens_id_list = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokens_list_padded]
    sents_tensor = torch.tensor(tokens_id_list, dtype=torch.long)
    return sents_tensor, masks_tensor, sents_lengths

6. Model Training

We use BertAdam optimizer for all BERT related models. The initial learning rate for the BERT model is set to 0.00002 and for non-BERT part of the model is set to 0.001. The batch-size is set to 64. Patients is set to 5, and the learning rate is reduced by half after reaching this number. The training is stopped early if 5 reduction in learning rate occurred. Cross entropy loss is used as the loss function. With each category weighted by $max(3, max_j(size(label_j)/size(label_i))$ .


x
from pytorch_pretrained_bert import BertAdam
from bert import default_bert, LSTM_bert 
import pickle
import numpy as np
import torch
import pandas as pd
import time
import sys
from utils import batch_iter
def validation(model, df_val, loss_func, bert_size):
    """ validation of model during training.
    @param model (nn.Module): the model being trained
    @param df_val (dataframe): validation dataset, sorted in descending text length
    @param loss_func(nn.Module): loss function
    @return avg loss value across validation dataset
    """
    was_training = model.training
    model.eval() #model.eval() put all layers in model in eval mode, that way, batchnorm or dropout layers will work in eval mode instead of training mode.
    df_val = df_val.sort_values(by='tweet_proc_BERT'+bert_size+'_length', ascending=False)
    tweet_proc_bert = list(df_val['tweet_proc_bert'])
    type_label = list(df_val['type_label'])
    val_batch_size = 32
    num_val_samples = df_val.shape[0]
    n_batch = int(np.ceil(num_val_samples/val_batch_size))
    total_loss = 0.
    
    with torch.no_grad():
        for i in range(n_batch):
            sents = tweet_proc_bert[i*val_batch_size: (i+1)*val_batch_size]
            targets = torch.tensor(type_label[i*val_batch_size: (i+1)*val_batch_size],
                                   dtype=torch.long)
            batch_size = len(sents)
            output = model(sents)
            batch_loss = loss_func(output, targets)
            total_loss += batch_loss.item()*batch_size
    if was_training:
        model.train()
    return total_loss/num_val_samples    
def train(args):
    
    label_name = ['not informative', 
            'other useful information', 
            'caution and advice',
            'affected individuals', 
            'infrastructure and utilities damage',
            'donations and volunteering',
            'sympathy and support',
            ]
    
#    save_file_name = args['--model']+'_model.bin'
    bert_size = args['--bert-config'].split('-')[1]
    
    start_time = time.time()
    print('Importing data...', file=sys.stderr)
    df_train = pd.read_csv(args['--train'], index_col=0)
    df_val = pd.read_csv(args['--dev'], index_col=0)
    train_label = dict(df_train['type_label'].value_counts())
    label_max = float(max(train_label.values()))
    print(train_label, file=sys.stderr)
    train_label_weight = torch.tensor([label_max/train_label[i] for i in range(len(train_label))])
    print('Done! time elapsed %.2f sec' % (time.time() - start_time), file=sys.stderr)
    print('-' * 80, file=sys.stderr)
    
    start_time = time.time()
    print('Set up model...', file=sys.stderr)
    
    if args['--model'] == 'default_bert':
        model = default_bert(num_class=len(label_name), bert_config=args['--bert-config'])
        optimizer_grouped_parameters = [
                {'params': model.model.bert.parameters()},
                {'params': model.model.classifier.parameters(), 'lr': float(args['--lr'])}
                ]
        optimizer = BertAdam(optimizer_grouped_parameters, 
                             lr=float(args['--lr-bert']),
                             max_grad_norm=float(args['--clip-grad'])
                             )
    elif args['--model'] == 'LSTM_bert':
        model = LSTM_bert(num_class=len(label_name), dropout_rate=float(args['--dropout']), bert_config=args['--bert-config'])
        optimizer_grouped_parameters = [
                {'params': model.bert.parameters()},
                {'params': model.lstm.parameters(), 'lr': float(args['--lr'])},
                {'params': model.fc.parameters(), 'lr': float(args['--lr'])}]
        optimizer = BertAdam(optimizer_grouped_parameters, 
                             lr=float(args['--lr-bert']),
                             max_grad_norm=float(args['--clip-grad'])
                             )
    else:
        print('wrong model...', file=sys.stderr)
            
        
    print('Done! time elapsed %.2f sec' % (time.time() - start_time), file=sys.stderr)
    print('-' * 80, file=sys.stderr)
    
    model.train() #set model for training mode
    criterion = torch.nn.CrossEntropyLoss(weight=train_label_weight, reduction='mean')
    torch.save(criterion, 'loss_func')  # for later testing
    train_batch_size = int(args['--batch-size'])
    valid_niter = int(args['--valid-niter'])
    display_num = int(args['--display_num'])
    model_save_path = args['--save-to']
    num_restarts = 0
    train_iter = patience = cum_loss = report_loss = 0
    total_samples = display_samples = epoch = 0
    valid_loss_hist = []
    train_time = begin_time = time.time()
    print('Begin training...')
    
    while True:
        epoch += 1
        for sents, targets in batch_iter(df_train, batch_size=train_batch_size, shuffle=False, bert=(args['--bert-config'])):  # for each epoch
            train_iter += 1
            
            batch_size = len(sents)
            labels = torch.tensor(targets, dtype=torch.long)
            
            optimizer.zero_grad() #restarting the grad accumulations between mini-batches
            output = model(sents) #pass through model
            loss = criterion(output, labels) #calculate loss
            loss.backward() #back prop
            optimizer.step() #update weights           
            batch_losses_val = loss.item() * batch_size
            report_loss += batch_losses_val
            cum_loss += batch_losses_val
            display_samples += batch_size
            total_samples += batch_size
            if train_iter % display_num == 0:
                print('epoch %d, iter %d, avg. loss %.2f, '
                      'total samples %d, speed %.2f samples/sec, '
                      'time elapsed %.2f sec' % 
                      (epoch, train_iter, report_loss / display_samples,
                       total_samples, display_samples / (time.time() - train_time),
                       time.time() - begin_time), file=sys.stderr)
                train_time = time.time()
                report_loss = display_samples = 0.
            # perform validation
            if train_iter % valid_niter == 0:
                print('epoch %d, iter %d, cum. loss %.2f, cum. examples %d' % 
                      (epoch, train_iter, cum_loss / total_samples, total_samples), file=sys.stderr)
                cum_loss = total_samples = 0.
                print('begin validation ...', file=sys.stderr)
                valid_loss = validation(model, df_val, criterion, bert_size=bert_size)                
                print('validation: iter %d, loss %f' % (train_iter, valid_loss), file=sys.stderr)
                
#                scheduler.step(valid_loss)
                improved_loss = len(valid_loss_hist)==0 or valid_loss < min(valid_loss_hist)
                valid_loss_hist.append(valid_loss)
                if improved_loss:
                    patience = 0
                    print('save currently the best model to [%s]' % args['--model']+'_model.bin', file=sys.stderr)
                    model.save(args['--model']+'_model.bin')
                    # also save the optimizers' state
                    torch.save(optimizer.state_dict(), args['--model'] + '.optim')
                else: #if valid loss did not improve
                    patience += 1
                    print('hit patience %d out of %d' % (patience, int(args['--patience'])), file=sys.stderr)
                    if patience >= int(args['--patience']):
                        num_restarts += 1
                        print('hit #%d restart out of max %d restarts' % (num_restarts, int(args['--max-num-trial'])), file=sys.stderr)
                        if num_restarts >= int(args['--max-num-trial']):
                            print('early termination!', file=sys.stderr)
                            exit(0)
                        # decay lr, and restore from previously best checkpoint
                        lr = optimizer.param_groups[0]['lr'] * float(args['--lr-decay'])
                        print('load previously best model and decay learning rate to %f' % lr, file=sys.stderr)
                        # load model
                        params = torch.load(args['--model'], map_location=lambda storage, loc: storage)
                        model.load_state_dict(params['state_dict'])
                        print('restore parameters of the optimizers', file=sys.stderr)
                        optimizer.load_state_dict(torch.load(args['--model'] + '.optim'))
                        # set new lr
                        for param_group in optimizer.param_groups:
                            param_group['lr'] = lr
                        # reset patience
                        patience = 0
                if epoch == int(args['--max-epoch']):
                    print('reached maximum number of epochs!', file=sys.stderr)
                    exit(0)

7. Results

We examine the accuracy and the Matthews correlation coefficient of all the models. Both BERT models out perform the baseline model with GloVE embedding, as expected. The LSTM BERT model, however, has a lower accuracy and MCC score than the base BERT sequence classification model.

Model	Accuracy	Matthews Correlation Coefficient
Baseline	0.6323	0.5674
Base BERT	0.6948	0.6401
LSTM BERT	0.6853	0.6311

We give the confusion matrix below. All models have trouble properly classifying the "not related or not informative" class. The accuracy for that class is around 50%.

Baseline Model Confusion Matrix

Baseline Model Confusion Matrix

BERT Sequence Classification Confusion Matrix

LSTM + BERT Confusion Matrix