Build a BiLSTM NER Model with PyTorch and GloVe

Introduction to Named Entity Recognition (NER)

Named Entity Recognition is a fundamental task in Natural Language Processing (NLP) that identifies and classifies named entities in text, such as person names, organizations, locations, dates, and more. NER is widely used in applications like chatbots, search engines, and content recommendation systems. In this tutorial, we'll build a Bidirectional LSTM (BiLSTM) model for NER using PyTorch, inspired by the CoNLL-2003 dataset. This approach is similar to what you might encounter in a CSCI544 homework assignment. By the end, you'll understand the architecture, training process, and how to leverage pre-trained word embeddings like GloVe.

NER has become increasingly important in modern AI systems. For example, when you ask a virtual assistant about the latest NBA scores or a trending movie, NER helps extract the relevant entities from your query. Similarly, in finance, NER can identify company names and stock symbols in news articles. This tutorial will give you hands-on experience with a core deep learning technique used in these applications.

Understanding the Dataset: CoNLL-2003 Format

The CoNLL-2003 dataset is a standard benchmark for NER. Each line in the dataset contains three fields separated by whitespace: the word index in the sentence, the word itself, and the NER tag (e.g., B-PER, I-ORG, O). Sentences are separated by blank lines. Here's an example:

1 U.N. B-ORG
2 officials O
3 in O
4 New B-LOC
5 York I-LOC
6 . O

This format is simple but powerful. For our PyTorch model, we'll need to convert words to indices and tags to numeric labels. We'll also handle unknown words and lowercase/uppercase variations carefully, as capitalization is important for NER (e.g., 'Apple' vs 'apple').

Task 1: Simple Bidirectional LSTM Model

Our first model is a straightforward BiLSTM with an embedding layer, a bidirectional LSTM, a linear layer with ELU activation, and a classifier. The architecture is: Embedding → BiLSTM → Linear → ELU → Classifier. We'll use the following hyperparameters:

Embedding dimension: 100
Number of LSTM layers: 1
LSTM hidden dimension: 256
LSTM Dropout: 0.33
Linear output dimension: 128

We'll train with SGD optimizer and tune batch size, learning rate, and learning rate scheduling. A reasonable F1 score on dev data is around 77%.

Implementing the BiLSTM in PyTorch

Let's start by defining the model class. We'll use PyTorch's nn.Module and nn.LSTM with bidirectional=True.

import torch
import torch.nn as nn

class BiLSTMNER(nn.Module):
    def __init__(self, vocab_size, tag_size, embedding_dim=100, hidden_dim=256, dropout=0.33, linear_dim=128):
        super(BiLSTMNER, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=1, bidirectional=True, dropout=dropout, batch_first=True)
        self.linear = nn.Linear(hidden_dim * 2, linear_dim)  # *2 for bidirectional
        self.elu = nn.ELU()
        self.classifier = nn.Linear(linear_dim, tag_size)

    def forward(self, x):
        emb = self.embedding(x)
        lstm_out, _ = self.lstm(emb)
        lin = self.linear(lstm_out)
        act = self.elu(lin)
        logits = self.classifier(act)
        return logits

This model takes a batch of tokenized sentences (word indices) and outputs logits for each token. We'll use cross-entropy loss and ignore padding indices.

Training Loop and Hyperparameter Tuning

We'll train for a fixed number of epochs (e.g., 10) with SGD. Learning rate scheduling like StepLR can help. Batch size of 32 or 64 works well. Here's a snippet:

model = BiLSTMNER(vocab_size, tag_size)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
criterion = nn.CrossEntropyLoss(ignore_index=0)

for epoch in range(10):
    for batch in train_loader:
        optimizer.zero_grad()
        logits = model(batch['input_ids'])
        loss = criterion(logits.view(-1, tag_size), batch['labels'].view(-1))
        loss.backward()
        optimizer.step()
    scheduler.step()

After training, evaluate on dev data using the official CoNLL-2003 evaluation script (conll03eval.pl). You'll need to format predictions as: idx word gold pred.

Task 2: Using GloVe Word Embeddings

Pre-trained GloVe embeddings (100d) can significantly improve performance. However, GloVe is case-insensitive, while NER benefits from case information. A common solution is to use GloVe vectors for both lowercase and uppercase versions of words, or to concatenate GloVe embeddings with character-level features. Here, we'll initialize the embedding layer with GloVe vectors and then fine-tune during training. For case sensitivity, we can keep the original word casing in the vocabulary and map to GloVe using lowercase, but also add a binary feature indicating capitalization. Alternatively, we can use GloVe for lowercase and train separate embeddings for uppercase variants.

Let's load GloVe and create an embedding matrix:

def load_glove_embeddings(glove_path, word2idx, embedding_dim=100):
    embeddings = np.random.uniform(-0.25, 0.25, (len(word2idx), embedding_dim))
    with open(glove_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            if word in word2idx:
                vector = np.asarray(values[1:], dtype='float32')
                embeddings[word2idx[word]] = vector
    return torch.tensor(embeddings, dtype=torch.float32)

embedding_matrix = load_glove_embeddings('glove.6B.100d.txt', word2idx)
model.embedding.weight.data.copy_(embedding_matrix)

With GloVe, the F1 score on dev should reach around 88%. This improvement comes from the rich semantic information in pre-trained embeddings.

Bonus: LSTM-CNN Model for Character-Level Features

To capture character-level patterns (e.g., prefixes, suffixes, capitalization), we can add a CNN module before the LSTM. The character embedding dimension is 30. We'll use 1D convolutions with kernel sizes like 3, 4, 5 to extract n-gram features.

class CharCNN(nn.Module):
    def __init__(self, char_vocab_size, char_emb_dim=30, kernel_sizes=[3,4,5], output_dim=100):
        super(CharCNN, self).__init__()
        self.char_emb = nn.Embedding(char_vocab_size, char_emb_dim, padding_idx=0)
        self.convs = nn.ModuleList([
            nn.Conv1d(char_emb_dim, output_dim, k) for k in kernel_sizes
        ])
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        # x: (batch, seq_len, word_len)
        batch, seq, word_len = x.shape
        x = x.view(batch * seq, word_len)  # (batch*seq, word_len)
        char_emb = self.char_emb(x)  # (batch*seq, word_len, char_emb_dim)
        char_emb = char_emb.permute(0, 2, 1)  # (batch*seq, char_emb_dim, word_len)
        conv_outputs = [torch.max(torch.relu(conv(char_emb)), dim=2)[0] for conv in self.convs]
        out = torch.cat(conv_outputs, dim=1)  # (batch*seq, output_dim * len(kernel_sizes))
        out = self.dropout(out)
        out = out.view(batch, seq, -1)
        return out

Then concatenate character-level features with word embeddings before feeding into LSTM. This can boost F1 further, especially on test data.

Evaluation and Submission

Use the official conll03eval.pl script to compute precision, recall, and F1. Format your prediction file as:

1 U.N. B-ORG B-ORG
2 officials O O
3 in O O
4 New B-LOC B-LOC
5 York I-LOC I-LOC
6 . O O

Then run: perl conll03eval.pl < pred_file

For submission, you'll need to save trained models (blstm1.pt, blstm2.pt), prediction files for dev and test, your Python code, and a README with instructions.

Conclusion

In this tutorial, we built a Bidirectional LSTM for NER from scratch, using PyTorch and GloVe embeddings. We covered data preprocessing, model architecture, training, and evaluation. This approach is not only relevant for academic assignments but also for real-world NLP applications like extracting entities from social media posts about trending events or financial news. With the bonus CNN module, you can further improve performance. Keep experimenting with hyperparameters and stay updated with the latest NLP trends!