• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Named Entity Recognition with Python

In machine learning, the recognition of named entities is an essential subtask of natural language processing. It tries to recognize and classify multi-word phrases with special meaning, e.g. people, organizations, places, dates, etc.

Named entity recognition comes from information retrieval (IE). IE’s job is to transform unstructured data into structured information. In Named Entity Recognition, unstructured data is the text written in natural language and we want to extract important information in a well-defined format eg. relational database.

The Named Entity Recognition task attempts to correctly detect and classify text expressions into a set of predefined classes. Classes can vary, but very often classes like people (PER), organizations (ORG) or places (LOC) are used.

I will start this task by importing the necessary Python libraries and the dataset:

In [1]:
import pandas as pd
data = pd.read_csv('ner_dataset.csv', encoding= 'unicode_escape')
data.head()
Out[1]:
Sentence # Word POS Tag
0 Sentence: 1 Thousands NNS O
1 NaN of IN O
2 NaN demonstrators NNS O
3 NaN have VBP O
4 NaN marched VBN O

I will train a neural network for the Named Entity Recognition (NER) task. So we need to make some modifications to the data to prepare it so that it can easily fit into a neutral network. I’ll start this step by extracting the mappings needed to train the neural network:

In [2]:
from itertools import chain
def get_dict_map(data, token_or_tag):
    tok2idx = {}
    idx2tok = {}
    
    if token_or_tag == 'token':
        vocab = list(set(data['Word'].to_list()))
    else:
        vocab = list(set(data['Tag'].to_list()))
    
    idx2tok = {idx:tok for  idx, tok in enumerate(vocab)}
    tok2idx = {tok:idx for  idx, tok in enumerate(vocab)}
    return tok2idx, idx2tok


token2idx, idx2token = get_dict_map(data, 'token')
tag2idx, idx2tag = get_dict_map(data, 'tag')
data['Word_idx'] = data['Word'].map(token2idx)
data['Tag_idx'] = data['Tag'].map(tag2idx)

Now, I’m going to transform the columns in the data to extract the sequential data from our neural network:

In [3]:
data_fillna = data.fillna(method='ffill', axis=0)
# Groupby and collect columns
data_group = data_fillna.groupby(
['Sentence #'],as_index=False
)['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx'].agg(lambda x: list(x))

I will now divide the data into training and test sets. I am going to create a function to split the data as LSTM layers only accept sequences of the same length. Thus, each sentence that appears as an integer in the data must be completed with the same length:

In [4]:
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
def get_pad_train_test_val(data_group, data):

    #get max token and tag length
    n_token = len(list(set(data['Word'].to_list())))
    n_tag = len(list(set(data['Tag'].to_list())))

    #Pad tokens (X var)    
    tokens = data_group['Word_idx'].tolist()
    maxlen = max([len(s) for s in tokens])
    pad_tokens = pad_sequences(tokens, maxlen=maxlen, dtype='int32', padding='post', value= n_token - 1)

    #Pad Tags (y var) and convert it into one hot encoding
    tags = data_group['Tag_idx'].tolist()
    pad_tags = pad_sequences(tags, maxlen=maxlen, dtype='int32', padding='post', value= tag2idx["O"])
    n_tags = len(tag2idx)
    pad_tags = [to_categorical(i, num_classes=n_tags) for i in pad_tags]
    
    #Split train, test and validation set
    tokens_, test_tokens, tags_, test_tags = train_test_split(pad_tokens, pad_tags, test_size=0.1, train_size=0.9, random_state=2020)
    train_tokens, val_tokens, train_tags, val_tags = train_test_split(tokens_,tags_,test_size = 0.25,train_size =0.75, random_state=2020)

    print(
        'train_tokens length:', len(train_tokens),
        '\ntrain_tokens length:', len(train_tokens),
        '\ntest_tokens length:', len(test_tokens),
        '\ntest_tags:', len(test_tags),
        '\nval_tokens:', len(val_tokens),
        '\nval_tags:', len(val_tags),
    )
    
    return train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags

train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags = get_pad_train_test_val(data_group, data)
Using TensorFlow backend.
train_tokens length: 32372 
train_tokens length: 32372 
test_tokens length: 4796 
test_tags: 4796 
val_tokens: 10791 
val_tags: 10791

Training a Neural Network for NER

I will now proceed to train the neural network architecture of our model. So let’s start by importing all the packages we need to train our neural network. Next, I’ll create layers that will take the dimensions of the LSTM layer and give the maximum length and maximum tags as output:

In [5]:
import numpy as np
import tensorflow
from tensorflow.keras import Sequential, Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from tensorflow.keras.utils import plot_model

from numpy.random import seed
seed(1)
tensorflow.random.set_seed(2)

input_dim = len(list(set(data['Word'].to_list())))+1
output_dim = 64
input_length = max([len(s) for s in data_group['Word_idx'].tolist()])
n_tags = len(tag2idx)

Now I will create a helper function that will help us to give the summary of each layer of the neural network model for the task of recognizing named entities with Python:

In [6]:
def get_bilstm_lstm_model():
    model = Sequential()

    # Add Embedding layer
    model.add(Embedding(input_dim=input_dim, output_dim=output_dim, input_length=input_length))

    # Add bidirectional LSTM
    model.add(Bidirectional(LSTM(units=output_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2), merge_mode = 'concat'))

    # Add LSTM
    model.add(LSTM(units=output_dim, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))

    # Add timeDistributed Layer
    model.add(TimeDistributed(Dense(n_tags, activation="relu")))

    #Optimiser 
    # adam = k.optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999)

    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    
    return model

Now I will create a function to train our model:

In [7]:
def train_model(X, y, model):
    loss = list()
    for i in range(25):
        # fit model for one epoch on this sequence
        hist = model.fit(X, y, batch_size=1000, verbose=1, epochs=1, validation_split=0.2)
        loss.append(hist.history['loss'][0])
    return loss
    
results = pd.DataFrame()
model_bilstm_lstm = get_bilstm_lstm_model()
plot_model(model_bilstm_lstm)
results['with_add_lstm'] = train_model(train_tokens, np.array(train_tags), model_bilstm_lstm)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 104, 64)           2251456   
_________________________________________________________________
bidirectional (Bidirectional (None, 104, 128)          66048     
_________________________________________________________________
lstm_1 (LSTM)                (None, 104, 64)           49408     
_________________________________________________________________
time_distributed (TimeDistri (None, 104, 17)           1105      
=================================================================
Total params: 2,368,017
Trainable params: 2,368,017
Non-trainable params: 0
_________________________________________________________________
26/26 [==============================] - 79s 3s/step - loss: 0.9465 - accuracy: 0.9178 - val_accuracy: 0.9681 - val_loss: 0.4600
26/26 [==============================] - 78s 3s/step - loss: 0.3821 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.3507
26/26 [==============================] - 78s 3s/step - loss: 0.3583 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.3141
26/26 [==============================] - 78s 3s/step - loss: 0.3061 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.2781
26/26 [==============================] - 78s 3s/step - loss: 0.2905 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.2648
26/26 [==============================] - 78s 3s/step - loss: 0.2787 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.2583
26/26 [==============================] - 78s 3s/step - loss: 0.2705 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.2547
26/26 [==============================] - 79s 3s/step - loss: 0.3674 - accuracy: 0.9676 - val_accuracy: 0.9670 - val_loss: 0.6538
26/26 [==============================] - 78s 3s/step - loss: 0.4729 - accuracy: 0.9664 - val_accuracy: 0.9682 - val_loss: 0.2367
26/26 [==============================] - 79s 3s/step - loss: 0.2188 - accuracy: 0.9678 - val_accuracy: 0.9681 - val_loss: 0.1946
26/26 [==============================] - 78s 3s/step - loss: 0.2012 - accuracy: 0.9678 - val_accuracy: 0.9681 - val_loss: 0.1882
26/26 [==============================] - 78s 3s/step - loss: 0.1905 - accuracy: 0.9678 - val_accuracy: 0.9681 - val_loss: 0.1954
26/26 [==============================] - 78s 3s/step - loss: 0.1877 - accuracy: 0.9678 - val_accuracy: 0.9682 - val_loss: 0.3138
26/26 [==============================] - 79s 3s/step - loss: 0.2475 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.1898
26/26 [==============================] - 78s 3s/step - loss: 0.1790 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.1622
26/26 [==============================] - 78s 3s/step - loss: 0.1530 - accuracy: 0.9678 - val_accuracy: 0.9681 - val_loss: 0.1506
26/26 [==============================] - 79s 3s/step - loss: 0.1464 - accuracy: 0.9678 - val_accuracy: 0.9682 - val_loss: 0.1740
26/26 [==============================] - 80s 3s/step - loss: 0.1813 - accuracy: 0.9678 - val_accuracy: 0.9682 - val_loss: 0.1599
26/26 [==============================] - 79s 3s/step - loss: 0.1380 - accuracy: 0.9678 - val_accuracy: 0.9682 - val_loss: 0.1367
26/26 [==============================] - 78s 3s/step - loss: 0.1274 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1330
26/26 [==============================] - 78s 3s/step - loss: 0.1217 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1282
26/26 [==============================] - 80s 3s/step - loss: 0.1175 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1243
26/26 [==============================] - 78s 3s/step - loss: 0.1129 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1210
26/26 [==============================] - 81s 3s/step - loss: 0.1108 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1199
26/26 [==============================] - 80s 3s/step - loss: 0.1094 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1201

Testing the Named Entity Recognition Model

Now, I will use the spacy library in Python to test our NER model. I will add input of some lines about my self and let’s see what we will get after running the code:

In [9]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
text = nlp('Hi, My name is Ramesh Kumar \n I am from India \n I want to work with Google \n Steve Jobs is My Inspiration')
displacy.render(text, style = 'ent', jupyter=True)
Hi, My name is Ramesh Kumar PERSON
I am from India GPE
I want to work with Google ORG
Steve Jobs PERSON is My Inspiration