Twitter Tweet Classification Using BERT1. Introduction and Background2. BERT3. Dataset4. Tweet Preprocessing5. Classification Model5.1 Baseline Model with LSTM and GloVE Embedding5.2 BERT Sequence Classification Model5.3 BERT with LSTM Classification Model5.4 Helper Functions6. Model Training7. Results
During a mass casualty event such as a natural disaster or a mass shooting, social networks such as Twitter or Facebook act as a conduit of information. These information include location and type of personal injury, infrastructure damage, donations, advice, and emotional support. Useful information can be harnessed by first responders and agencies to assess damage and coordinate rescue operations. However, the speed and the mass at which the information come in presents a challenge to rescue personnel to discern the relevant ones from extraneous ones. In this light, we want to create a machine learning model that can automatically classify social media posts into different categories. The model aims to extract useful information from a sea of posts, and sort the useful information into several classes.
In this project, we will classify Twitter tweets during natural disasters. We attempt to classify tweets into 7 categories:
Our baseline model will be a LSTM model using the Stanford GloVE Twitter word embedding. We will compare the base model with a Google BERT base classifier model and BERT model modified with an LSTM. The models will be written in Pytorch.
In natural language processing, a word is represented by a vector of numbers before input into a machine learning model for processing. These word vectors are commonly referred to as word embeddings. A word embedding should not be a random vector, but rather be able to express the meaning of the word or relations between different words. For example, the embedding for "queen" subtract the embedding for "woman" and then add the embedding for "man" should result in the embedding for "king".
Thus these word representations allow mathematical operations to be conducted on the "meaning" of the words in the machine learning model, and allows the model to discern the overall meaning of an utterance.
Word representations are often trained on large text corpus based on co-occurrence statistics, which are the number of times words appear close to each other in a sentence. Pre-trained presentations can be broadly into two classes, contextual or non-contextual. Contextual representations can be further divided into unidirectional or bidirectional .
Noncontextual representations use the same vector to represent words that have multiple meanings, regardless of which context the words appears in. For example, in both of the following sentences, "He flipped the light switch", and "His steps were light ", the same vector is used for the word "light". This is despite the fact that based on context, the same word have completely different meanings. Commonly used word2vec and GloVE embeddings belong to this class of representation.
Contextual representations would use different vectors to represent a word based on the context in which the word appears in. This type of representation is more accurate and performs better This class of word representation can be divided into unidirectional or bidirectional. In unidirectional contextual representations, only the context on one side of the word, either left or right, is used to generate the representation. The ELMo representation belongs to the class of unidirectional contextual word representations.
BERT, or Bidirectional Encoder Representations from Transformers, is a bidirectional contextual word representation. Bidirectional representations are better because the meaning of a words depends on both the words before and after it. Unidirectional contextual embeddings are commonly generated by scanning forward or backwards from the word of interest. Scanning both sides of the word of interest together would cause the word to "see itself" in the context. BERT solves this problem by first masking a certain percentage of the words in the sequence, then use an bidirectional encoder called transformers to scan the entire sequence, and finally predict the masked words. For example,
Input: He took his children to the [mask1] to read some books after [mask2] ended.
Labels: mask1 = library, mask2 = school
Labelled tweets are gathered from 2 online sources, CrisisNLP, and CrissisLex. Each website contains several repositories. Each repository contains tweets gathered during various natural disasters in a csv file. We will only choose English tweets. The labels from each repository are not identical but largely similar. We will map the labels to a set of common labels.
Lets preprocess the tweets into the appropriate format before feeding them into the network. We download the English tweets from the CrisisNLP and CrisisLex online repositories and story them in separate file directories.
First we list all the csv files we have saved in the crisisNLP directory.
xxxxxxxxxx
import os
import pandas as pd
data_dir = 'data/crisisNLP'
files_csv = []
for file in os.listdir(data_dir):
if file.endswith(".csv"):
files_csv.append(data_dir+file)
All the csv files are loaded and appended into a single Pandas data frame .
x
crisisNLP_df = pd.DataFrame()
for file in files_csv:
temp_df = pd.read_csv(file, encoding='utf-8')
crisisNLP_df = pd.concat([crisisNLP_df, temp_df])
Lets examine the csv file.
xxxxxxxxxx
crisisNLP_df.head(10)
_unit_id | _golden | _unit_state | _trusted_judgments | _last_judgment_at | choose_one_category | choose_one_category:confidence | choose_one_category_gold | tweet_id | tweet_text |
---|---|---|---|---|---|---|---|---|---|
879819338 | False | finalized | 3 | 2/11/2016 11:41:40 | other_useful_information | 0.3429 | NaN | '382813388503412736' | USGS reports a M1.7 #earthquake 70km ESE of He... |
879819339 | False | finalized | 3 | 2/11/2016 09:52:13 | not_related_or_irrelevant | 0.6345 | NaN | 382813391435206659 | QuakeFactor M 3.0, Southern Alaska: Sunday, Se... |
As on can see there are 10 columns, the relevant ones are the "choose_one_category", "choose_one_category:confidence", and "tweet_text" columns. In particular, the "choose_one_category" is the tweet labels, and the "choose_one_category:confidence" is the confidence of the labels.
We will drop the columns that are not relevant, and also only keep the rows that have confidence level greater than 0.6.
xxxxxxxxxx
drop_cols = ['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
'_last_judgment_at', 'choose_one_category_gold',
'tweet_id']
crisisNLP_df.drop(columns=drop_cols, inplace=True)
crisisNLP_df[crisisNLP_df['choose_one_category:confidence'] >=0.6]
We do the same for CrisisLex data. We drop the columns not needed by us.
x
data_dir = 'data/CrisisLexT26/'
files_csv = []
for (dirpath, dirnames, filenames) in os.walk(data_dir):
for file in filenames:
if file.endswith("labeled.csv"):
files_csv.append(dirpath+'/'+file)
crisisLex_df = pd.DataFrame()
for file in files_csv:
temp_df = pd.read_csv(file, encoding='utf-8')
crisisLex_df = pd.concat([crisisLex_df, temp_df])
drop_cols = ['Tweet ID',' Information Source',
' Informativeness']
crisisLex_df.drop(columns=drop_cols, inplace=True)
We combine the two data frames into a common data frame. But before we do that, we need to give the same name to the corresponding columns in both data frames. We will give the tweet column the name "tweet" and category column the name "type". We will drop the confidence column in CrisisNLP and finall concatenate the two data frames.
x
crisisLex_df.rename(index=str, columns={" Information Type": "type", " Tweet Text": "tweet"}, inplace=True)
crisisNLP_df.rename(index=str, columns={"choose_one_category": "type", "tweet_text": "tweet"}, inplace=True)
crisisNLP_df.drop(columns='choose_one_category:confidence', inplace=True)
df=pd.concat([crisisLex_df,crisisNLP_df])
Now that we have stripped the necessary data from the csv file, we need to clean up the tweets. We will
xxxxxxxxxx
#remove URL
df['tweet_proc'] = df['tweet'].str.replace(r'http(\S)+', r'')
df['tweet_proc'] = df['tweet_proc'].str.replace(r'http ...', r'')
df['tweet_proc'] = df['tweet_proc'].str.replace(r'http', r'')
df[df['tweet_proc'].str.contains(r'http')]
# remove RT, @
df['tweet_proc'] = df['tweet_proc'].str.replace(r'(RT|rt)[ ]*@[ ]*[\S]+',r'')
df[df['tweet_proc'].str.contains(r'RT[ ]?@')]
df['tweet_proc'] = df['tweet_proc'].str.replace(r'@[\S]+',r'')
#remove non-ascii words and characters
df['tweet_proc'] = [''.join([i if ord(i) < 128 else '' for i in text]) for text in df['tweet_proc']]
df['tweet_proc'] = df['tweet_proc'].str.replace(r'_[\S]?',r'')
#remove &, < and >
df['tweet_proc'] = df['tweet_proc'].str.replace(r'&?',r'and')
df['tweet_proc'] = df['tweet_proc'].str.replace(r'<',r'<')
df['tweet_proc'] = df['tweet_proc'].str.replace(r'>',r'>')
# remove extra space
df['tweet_proc'] = df['tweet_proc'].str.replace(r'[ ]{2, }',r' ')
# insert space between punctuation marks
df['tweet_proc'] = df['tweet_proc'].str.replace(r'([\w\d]+)([^\w\d ]+)', r'\1 \2')
df['tweet_proc'] = df['tweet_proc'].str.replace(r'([^\w\d ]+)([\w\d]+)', r'\1 \2')
# lower case and strip white spaces at both ends
df['tweet_proc'] = df['tweet_proc'].str.lower()
df['tweet_proc'] = df['tweet_proc'].str.strip()
We will calculate the length of each tweet and only keep unique tweets that are 3 words or longer.
x
df['tweet_proc_length'] = [len(text.split(' ')) for text in df['tweet_proc']]
df['tweet_proc_length'].value_counts()
df = df[df['tweet_proc_length']>3]
df = df.drop_duplicates(subset=['tweet_proc'])
Now we have our set of working data. We will consolidate all the labels into the following 7 labels.
Afterwards, we will assign a numeric type to each of the labels,
xxxxxxxxxx
type_map={'Not labeled': 'not informative',
'Other Useful Information': 'other useful information',
'Caution and advice': 'caution and advice',
'Affected individuals': 'affected individuals',
'Infrastructure and utilities': 'infrastructure and utilities damage',
'Donations and volunteering': 'donations and volunteering',
'Sympathy and support':'sympathy and support',
'Not applicable': 'not informative',
'other_useful_information': 'other useful information',
'not_related_or_irrelevant': 'not informative',
'donation_needs_or_offers_or_volunteering_services': 'donations and volunteering',
'injured_or_dead_people':'affected individuals',
'missing_trapped_or_found_people':'affected individuals',
'caution_and_advice':'caution and advice',
'infrastructure_and_utilities_damage':'infrastructure and utilities damage',
'sympathy_and_emotional_support':'sympathy and support',
'displaced_people_and_evacuations':'affected individuals'
}
df['type'] = df['type'].map(type_map)
df['type'].value_counts()
label_dict = dict()
for i, l in enumerate(list(df.type.value_counts().keys())):
label_dict.update({l: i})
# for each unique label, assign a numeric identiifer
df['type_label'] = [label_dict[label] for label in df.type] #create a column in df to store the numeric ids
The final step is to preprocess the tweet texts for input into the BERT classifier. BER classifier requires the input be prefixed by the "[CLS]" token. We will also tokenize the tweet text with the BERT Tokenizer and calculate the length of the tokenized text.
x
df['tweet_proc_bert'] = '[CLS] '+df['tweet_proc']
from pytorch_pretrained_bert import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
df['tweet_proc_BERTbase_length'] = [len(tokenizer.tokenize(sent)) for sent in df['tweet_proc_bert']]
The models will be programmed using Pytorch. We will compare 3 different classification models. The baseline model is a LSTM network using the GloVE twitter word embedding. It will be compared with two BERT based model. The basic BERT model is the pretrained BertForSequenceClassification model. We will be finetuning it on the twitter dataset. The second BERT based model stacks a LSTM on top of BERT.
We use a single layer bi-directional LSTM neural network model as our baseline. The hidden size of the LSTM cell is 256. Tweets are first embedded using the GloVE Twitter embedding with 50 dimensions. The stacked final state of the LSTM cell is linked to a softmax classifier through a fully connected layer.
x
import torch
import torch.nn as nn
import torch.nn.utils
from torch.nn.utils.rnn import pack_padded_sequence
import sys
import pickle
from vocab import VocabEntry
import numpy as np
class BaselineModel(nn.Module):
def __init__(self, rnn_state_size, embedding, vocab, num_tweet_class, dropout_rate=0):
"""
@param hidden_size (int): size of lstm hidden layer
@param embedding (torch.Tensor): shape (num_of_words_in_dict, embed_dim), glove embedding matrix
@param vocab (VocabEntry): GloVe word dictionary/index
@param num_tweet_class (int): number of labels / classes
@param dropout_rate (float): dropout rate for training
"""
super(BaselineModel, self).__init__()
self.rnn_state_size = rnn_state_size
self.embed_dim = embedding.size(1)
self.vocab = vocab
self.num_tweet_class = num_tweet_class
self.padding_idx = self.vocab['<pad>']
self.dropout_rate = dropout_rate
self.embedding_layer = nn.Embedding.from_pretrained(embeddings = embedding,
freeze=True, # freeze weights from training
padding_idx=self.padding_idx,
max_norm=None,
norm_type=2.0,
scale_grad_by_freq=False,
sparse=False)
# Create a embedding using GloVE pretrained weights
self.lstm = nn.LSTM(input_size = self.embed_dim,
hidden_size = self.rnn_state_size,
num_layers = 1,
bias = True,
batch_first = False,
dropout = 0,
bidirectional = True #use a bi-directional LSTM
)
self.affine = nn.Linear(in_features = 2*self.rnn_state_size, #the hidden stats of the biLSTM is stacked
out_features = self.num_tweet_class,
bias=True)
# fully connected layer before softmax
self.dropout = nn.Dropout(p=self.dropout_rate)
def forward(self, sents):
"""
@param sents (list[list[str]]): a list of a list of words, sorted in descending length
@return output (torch.Tensor): logits to put into softmax function to calculate prob
"""
text_lengths = torch.tensor([len(sent) for sent in sents])
sents_tensor = self.vocab.to_input_tensor(sents) # Convert from list to tensor (max_sent_length, batch_size)
x_embed = self.embedding_layer(sents_tensor) # create embedding for words (max_sent_length, batch_size, embed_size)
seq = pack_padded_sequence(x_embed.float(), text_lengths)
enc_hiddens, (last_hidden, last_cell) = self.lstm(seq)
output_hidden = torch.cat((last_hidden[0], last_hidden[1]), dim=1) # (batch_size, 2*hidden_size)
output_hidden = self.dropout(output_hidden)
output = self.affine(output_hidden) # (batch_size, n_class)
return output
def load(model_path: str):
""" Load the model from a file.
@param model_path (str): path to model
@return model (nn.Module): model with saved parameters
"""
params = torch.load(model_path, map_location=lambda storage, loc: storage)
args = params['args']
model = BaselineModel(vocab=params['vocab'], embedding=params['embedding'], **args)
model.load_state_dict(params['state_dict'])
return model
def save(self, path: str):
""" Save the model to a file.
@param path (str): path to the model
"""
print('save model parameters to [%s]' % path, file=sys.stderr)
params = {
'args': dict(rnn_state_size=self.rnn_state_size,
dropout_rate=self.dropout_rate,
num_tweet_class=self.num_tweet_class),
'vocab': self.vocab,
'embedding': self.embedding_layer.weight,
'state_dict': self.state_dict()
}
torch.save(params, path)
We finetune the basic BERT sequence classification model, BertForSequenceClassification. We use the English BERT uncased base model, which has 12 transformer layers, 12 self-attention heads, and a hidden size of 768.
x
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForSequenceClassification
import torch
from torch.nn.utils.rnn import pack_padded_sequence
import sys
import numpy as np
class default_bert(torch.nn.Module):
def __init__(self, num_class, bert_config='bert-base-uncased'):
"""
:param num_class: number of classification categories
:param bert_config: str, BERT model used
"""
super(default_bert, self).__init__()
self.num_class = num_class
self.bert_config = bert_config
self.model = BertForSequenceClassification.from_pretrained(self.bert_config, num_labels=self.num_class)
self.tokenizer = BertTokenizer.from_pretrained(self.bert_config)
def forward(self, sents):
"""
:param sents: list[str], list of untokenized sentences
"""
sents_tensor, masks_tensor, sents_lengths = sents_to_tensor(self.tokenizer, sents) #helpfer function to convert string to tensors, and create padding mask
output = self.model(input_ids=sents_tensor,
attention_mask=masks_tensor)
return output
def load(model_path: str):
""" Load the model from a file.
@param model_path (str): path to model
@return model (nn.Module): model with saved parameters
"""
params = torch.load(model_path, map_location=lambda storage, loc: storage)
args = params['args']
model = default_bert(**args)
model.load_state_dict(params['state_dict'])
return model
def save(self, path: str):
""" Save the model to a file.
@param path (str): path to the model
"""
print('save model parameters to [%s]' % path, file=sys.stderr)
params = {
'args': dict(bert_config=self.bert_config, num_class=self.num_class),
'state_dict': self.state_dict()
}
torch.save(params, path)
We stack a bidirectional LSTM on top of BERT. The input to the LSTM is the BERT final hidden states of the entire tweet. Then, as the baseline model, the stacked hidden states of the LSTM is connected to a softmax classifier through a affine layer.
x
class LSTM_bert(torch.nn.Module):
def __init__(self, num_class, dropout_rate, bert_config='bert-base-uncased'):
"""
:param num_class: int, number of categories
:param bert_config: str, BERT configuration description
:param dropout_rate: float
"""
super(LSTM_bert, self).__init__()
self.num_class = num_class
self.bert_config = bert_config
self.tokenizer = BertTokenizer.from_pretrained(self.bert_config)
self.bert = BertModel.from_pretrained(self.bert_config)
self.dropout_rate = dropout_rate
self.lstm_input_size = self.bert.config.hidden_size
self.lstm_hidden_size = self.bert.config.hidden_size
self.lstm = torch.nn.LSTM(input_size = self.lstm_input_size,
hidden_size = self.lstm_hidden_size,
bidirectional = True)
self.dropout = torch.nn.Dropout(p=self.dropout_rate)
self.fc = torch.nn.Linear(in_features = 2*self.lstm_hidden_size, #LSTM stacked hidden state
out_features = self.num_class,
bias=True)
def forward(self, sents):
"""
:param sents: list[str], list of untokenized sentences
:return: torch.tensor of shape (batch_size, num_class)
"""
sents_tensor, masks_tensor, sents_lengths = sents_to_tensor(self.tokenizer, sents)
# 1. The tweet is first input to the model
encoded_layers, pooled_output = self.bert(input_ids=sents_tensor,
attention_mask=masks_tensor,
output_all_encoded_layers=False)
# 2. The output is reshuffled to the correct format
encoded_layers = encoded_layers.permute(1, 0, 2) #permute dimensions to fit LSTM input
# 3. The encoded layers are fed into the biLSTM
enc_hiddens, (last_hidden, last_cell) = self.lstm(pack_padded_sequence(encoded_layers, sents_lengths))
# 4. final hidden states of the biLSTM is concatenated together
output_hidden = torch.cat((last_hidden[0,:,:], last_hidden[1,:,:]), dim=1)
# h_n of shape (num_layers * num_directions, batch, hidden_size)
# 5. Dropout applied
output_hidden = self.dropout(output_hidden)
# 6. Affine layer before softmax
output = self.fc(output_hidden)
return output
def load(model_path: str):
""" Load the model from a file.
@param model_path (str): path to model
@return model (nn.Module): model with saved parameters
"""
params = torch.load(model_path, map_location=lambda storage, loc: storage)
args = params['args']
model = LSTM_bert(**args)
model.load_state_dict(params['state_dict'])
return model
def save(self, path: str):
""" Save the model to a file.
@param path (str): path to the model
"""
print('save model parameters to [%s]' % path, file=sys.stderr)
params = {
'args': dict(bert_config=self.bert_config, num_class=self.num_class, dropout_rate=self.dropout_rate),
'state_dict': self.state_dict()
}
torch.save(params, path)
These are the helper functions used to convert batches of tweets to tensors, and pad to a common length in the process.
x
def pad_sents(sents, pad_token):
""" Pad list of sentences to the longest length in the batch.
@param sents (list[list[str]]): list of tokenized strings
@param pad_token (int): pad token
@returns sents_padded (list[list[int]]): list of tokenized sentences with padding shape: (batch_size, max_sentence_length)
"""
sents_padded = []
max_len = max(len(s) for s in sents)
for s in sents:
padded = [pad_token] * max_len
padded[:len(s)] = s
sents_padded.append(padded)
return sents_padded
def sents_to_tensor(tokenizer, sents):
"""
:param tokenizer
:param sents: list[str], list of untokenized strings
"""
tokens_list = [tokenizer.tokenize(sent) for sent in sents]
sents_lengths = [len(tokens) for tokens in tokens_list]
sents_lengths = torch.tensor(sents_lengths)
tokens_list_padded = pad_sents(tokens_list, '[PAD]')
masks = np.asarray(tokens_list_padded)!='[PAD]'
masks_tensor = torch.tensor(masks, dtype=torch.long)
tokens_id_list = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokens_list_padded]
sents_tensor = torch.tensor(tokens_id_list, dtype=torch.long)
return sents_tensor, masks_tensor, sents_lengths
We use BertAdam optimizer for all BERT related models. The initial learning rate for the BERT model is set to 0.00002 and for non-BERT part of the model is set to 0.001. The batch-size is set to 64. Patients is set to 5, and the learning rate is reduced by half after reaching this number. The training is stopped early if 5 reduction in learning rate occurred. Cross entropy loss is used as the loss function. With each category weighted by .
x
from pytorch_pretrained_bert import BertAdam
from bert import default_bert, LSTM_bert
import pickle
import numpy as np
import torch
import pandas as pd
import time
import sys
from utils import batch_iter
def validation(model, df_val, loss_func, bert_size):
""" validation of model during training.
@param model (nn.Module): the model being trained
@param df_val (dataframe): validation dataset, sorted in descending text length
@param loss_func(nn.Module): loss function
@return avg loss value across validation dataset
"""
was_training = model.training
model.eval() #model.eval() put all layers in model in eval mode, that way, batchnorm or dropout layers will work in eval mode instead of training mode.
df_val = df_val.sort_values(by='tweet_proc_BERT'+bert_size+'_length', ascending=False)
tweet_proc_bert = list(df_val['tweet_proc_bert'])
type_label = list(df_val['type_label'])
val_batch_size = 32
num_val_samples = df_val.shape[0]
n_batch = int(np.ceil(num_val_samples/val_batch_size))
total_loss = 0.
with torch.no_grad():
for i in range(n_batch):
sents = tweet_proc_bert[i*val_batch_size: (i+1)*val_batch_size]
targets = torch.tensor(type_label[i*val_batch_size: (i+1)*val_batch_size],
dtype=torch.long)
batch_size = len(sents)
output = model(sents)
batch_loss = loss_func(output, targets)
total_loss += batch_loss.item()*batch_size
if was_training:
model.train()
return total_loss/num_val_samples
def train(args):
label_name = ['not informative',
'other useful information',
'caution and advice',
'affected individuals',
'infrastructure and utilities damage',
'donations and volunteering',
'sympathy and support',
]
# save_file_name = args['--model']+'_model.bin'
bert_size = args['--bert-config'].split('-')[1]
start_time = time.time()
print('Importing data...', file=sys.stderr)
df_train = pd.read_csv(args['--train'], index_col=0)
df_val = pd.read_csv(args['--dev'], index_col=0)
train_label = dict(df_train['type_label'].value_counts())
label_max = float(max(train_label.values()))
print(train_label, file=sys.stderr)
train_label_weight = torch.tensor([label_max/train_label[i] for i in range(len(train_label))])
print('Done! time elapsed %.2f sec' % (time.time() - start_time), file=sys.stderr)
print('-' * 80, file=sys.stderr)
start_time = time.time()
print('Set up model...', file=sys.stderr)
if args['--model'] == 'default_bert':
model = default_bert(num_class=len(label_name), bert_config=args['--bert-config'])
optimizer_grouped_parameters = [
{'params': model.model.bert.parameters()},
{'params': model.model.classifier.parameters(), 'lr': float(args['--lr'])}
]
optimizer = BertAdam(optimizer_grouped_parameters,
lr=float(args['--lr-bert']),
max_grad_norm=float(args['--clip-grad'])
)
elif args['--model'] == 'LSTM_bert':
model = LSTM_bert(num_class=len(label_name), dropout_rate=float(args['--dropout']), bert_config=args['--bert-config'])
optimizer_grouped_parameters = [
{'params': model.bert.parameters()},
{'params': model.lstm.parameters(), 'lr': float(args['--lr'])},
{'params': model.fc.parameters(), 'lr': float(args['--lr'])}]
optimizer = BertAdam(optimizer_grouped_parameters,
lr=float(args['--lr-bert']),
max_grad_norm=float(args['--clip-grad'])
)
else:
print('wrong model...', file=sys.stderr)
print('Done! time elapsed %.2f sec' % (time.time() - start_time), file=sys.stderr)
print('-' * 80, file=sys.stderr)
model.train() #set model for training mode
criterion = torch.nn.CrossEntropyLoss(weight=train_label_weight, reduction='mean')
torch.save(criterion, 'loss_func') # for later testing
train_batch_size = int(args['--batch-size'])
valid_niter = int(args['--valid-niter'])
display_num = int(args['--display_num'])
model_save_path = args['--save-to']
num_restarts = 0
train_iter = patience = cum_loss = report_loss = 0
total_samples = display_samples = epoch = 0
valid_loss_hist = []
train_time = begin_time = time.time()
print('Begin training...')
while True:
epoch += 1
for sents, targets in batch_iter(df_train, batch_size=train_batch_size, shuffle=False, bert=(args['--bert-config'])): # for each epoch
train_iter += 1
batch_size = len(sents)
labels = torch.tensor(targets, dtype=torch.long)
optimizer.zero_grad() #restarting the grad accumulations between mini-batches
output = model(sents) #pass through model
loss = criterion(output, labels) #calculate loss
loss.backward() #back prop
optimizer.step() #update weights
batch_losses_val = loss.item() * batch_size
report_loss += batch_losses_val
cum_loss += batch_losses_val
display_samples += batch_size
total_samples += batch_size
if train_iter % display_num == 0:
print('epoch %d, iter %d, avg. loss %.2f, '
'total samples %d, speed %.2f samples/sec, '
'time elapsed %.2f sec' %
(epoch, train_iter, report_loss / display_samples,
total_samples, display_samples / (time.time() - train_time),
time.time() - begin_time), file=sys.stderr)
train_time = time.time()
report_loss = display_samples = 0.
# perform validation
if train_iter % valid_niter == 0:
print('epoch %d, iter %d, cum. loss %.2f, cum. examples %d' %
(epoch, train_iter, cum_loss / total_samples, total_samples), file=sys.stderr)
cum_loss = total_samples = 0.
print('begin validation ...', file=sys.stderr)
valid_loss = validation(model, df_val, criterion, bert_size=bert_size)
print('validation: iter %d, loss %f' % (train_iter, valid_loss), file=sys.stderr)
# scheduler.step(valid_loss)
improved_loss = len(valid_loss_hist)==0 or valid_loss < min(valid_loss_hist)
valid_loss_hist.append(valid_loss)
if improved_loss:
patience = 0
print('save currently the best model to [%s]' % args['--model']+'_model.bin', file=sys.stderr)
model.save(args['--model']+'_model.bin')
# also save the optimizers' state
torch.save(optimizer.state_dict(), args['--model'] + '.optim')
else: #if valid loss did not improve
patience += 1
print('hit patience %d out of %d' % (patience, int(args['--patience'])), file=sys.stderr)
if patience >= int(args['--patience']):
num_restarts += 1
print('hit #%d restart out of max %d restarts' % (num_restarts, int(args['--max-num-trial'])), file=sys.stderr)
if num_restarts >= int(args['--max-num-trial']):
print('early termination!', file=sys.stderr)
exit(0)
# decay lr, and restore from previously best checkpoint
lr = optimizer.param_groups[0]['lr'] * float(args['--lr-decay'])
print('load previously best model and decay learning rate to %f' % lr, file=sys.stderr)
# load model
params = torch.load(args['--model'], map_location=lambda storage, loc: storage)
model.load_state_dict(params['state_dict'])
print('restore parameters of the optimizers', file=sys.stderr)
optimizer.load_state_dict(torch.load(args['--model'] + '.optim'))
# set new lr
for param_group in optimizer.param_groups:
param_group['lr'] = lr
# reset patience
patience = 0
if epoch == int(args['--max-epoch']):
print('reached maximum number of epochs!', file=sys.stderr)
exit(0)
We examine the accuracy and the Matthews correlation coefficient of all the models. Both BERT models out perform the baseline model with GloVE embedding, as expected. The LSTM BERT model, however, has a lower accuracy and MCC score than the base BERT sequence classification model.
Model | Accuracy | Matthews Correlation Coefficient |
---|---|---|
Baseline | 0.6323 | 0.5674 |
Base BERT | 0.6948 | 0.6401 |
LSTM BERT | 0.6853 | 0.6311 |
We give the confusion matrix below. All models have trouble properly classifying the "not related or not informative" class. The accuracy for that class is around 50%.
Baseline Model Confusion Matrix
BERT Sequence Classification Confusion Matrix
LSTM + BERT Confusion Matrix