• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Autocorrect with Python

Have you ever thought about how the autocorrect features works in the keyboard of a smartphone? Almost every smartphone brand irrespective of its price provides an autocorrect feature in their keyboards today. So let’s understand how the autocorrect features works. In this article, I will take you through how to build autocorrect with Python.

With the context of machine learning, autocorrect is based on natural language processing. As the name suggests it is programmed to correct spellings and errors while typing.

Here i am using text from a book as Dataset named as Data.txt.

First we call all necessary libraries :

In [1]:
import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter

Now we read the Data.txt file as f:

In [2]:
words = []
with open('Data.txt', 'r') as f:
    file_name_data = f.read()
    file_name_data=file_name_data.lower()
    words = re.findall('\w+',file_name_data)
# This is our vocabulary
V = set(words)
print(f"The first ten words in the text are: \n{words[0:10]}")
print(f"There are {len(V)} unique words in the vocabulary.")
The first ten words in the text are: 
['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale']
There are 17647 unique words in the vocabulary.

In the above code, we made a list of words, and now we need to build the frequency of those words, which can be easily done by using the counter function in Python :

In [3]:
word_freq_dict = {}
word_freq_dict = Counter(words)
print(word_freq_dict.most_common()[0:10])
[('the', 14703), ('of', 6742), ('and', 6517), ('a', 4799), ('to', 4707), ('in', 4238), ('that', 3081), ('it', 2534), ('his', 2530), ('i', 2120)]

Relative Frequency of words

Now we want to get the probability of occurrence of each word, this equals the relative frequencies of the words:

In [4]:
probs = {}
Total = sum(word_freq_dict.values())
for k in word_freq_dict.keys():
    probs[k] = word_freq_dict[k]/Total

Finding Similar Words

Now we will sort similar words according to the Jaccard distance by calculating the 2 grams Q of the words. Next, we will return the 5 most similar words ordered by similarity and probability:

In [5]:
def my_autocorrect(input_word):
    input_word = input_word.lower()
    if input_word in V:
        return('Your word seems to be correct')
    else:
        similarities = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq_dict.keys()]
        df = pd.DataFrame.from_dict(probs, orient='index').reset_index()
        df = df.rename(columns={'index':'Word', 0:'Prob'})
        df['Similarity'] = similarities
        output = df.sort_values(['Similarity', 'Prob'], ascending=False).head()
        return(output)

Let's check for similarity of word 'nevertheless' from the set of words :

In [6]:
my_autocorrect('nevertheless')
Out[6]:
'Your word seems to be correct'

Here now we check for the wrong spelled word 'nevrtless' and it return the 5 most similar words ordered by similarity and probability.

In [7]:
my_autocorrect('nevrtless')
Out[7]:
Word Prob Similarity
2571 nevertheless 0.000225 0.461538
10481 heartless 0.000018 0.454545
13600 nestle 0.000004 0.444444
16146 heartlessness 0.000004 0.428571
12513 subtleness 0.000004 0.416667