Blogs >
Build Autocorrect in Python

Build Autocorrect in Python ¶

Have you ever thought about how the autocorrect features works in the keyboard of a smartphone? Almost every smartphone brand irrespective of its price provides an autocorrect feature in their keyboards today. So let’s understand how the autocorrect features works. In this article, I will take you through how to build autocorrect with Python.

With the context of Machine Learning, autocorrect is based on Natural Language Processing. As the name suggests it is programmed to correct spellings and errors while typing.

Here I am using text from a book as Dataset named as Data.txt and Build Autocorrect in Python

First we call all necessary libraries :

In [1]:

import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter

Now we read the Data.txt file as f:

In [2]:

words = []
with open('Data.txt', 'r') as f:
    file_name_data = f.read()
    file_name_data=file_name_data.lower()
    words = re.findall('\w+',file_name_data)
# This is our vocabulary
V = set(words)
print(f"The first ten words in the text are: \n{words[0:10]}")
print(f"There are {len(V)} unique words in the vocabulary.")

The first ten words in the text are:
['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale']
There are 17647 unique words in the vocabulary.

In the above code, we made a list of words, and now we need to build the frequency of those words, which can be easily done by using the counter function in Python :

In [3]:

word_freq_dict = {}
word_freq_dict = Counter(words)
print(word_freq_dict.most_common()[0:10])

[('the', 14703), ('of', 6742), ('and', 6517), ('a', 4799), ('to', 4707), ('in', 4238), ('that', 3081), ('it', 2534), ('his', 2530), ('i', 2120)]

Relative Frequency of words¶

Now we want to get the probability of occurrence of each word, this equals the relative frequencies of the words:

In [4]:

probs = {}
Total = sum(word_freq_dict.values())
for k in word_freq_dict.keys():
    probs[k] = word_freq_dict[k]/Total

Finding Similar Words ¶

Now we will sort similar words according to the Jaccard distance by calculating the 2 grams Q of the words. Next, we will return the 5 most similar words ordered by similarity and probability:

In [5]:

def my_autocorrect(input_word):
    input_word = input_word.lower()
    if input_word in V:
        return('Your word seems to be correct')
    else:
        similarities = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq_dict.keys()]
        df = pd.DataFrame.from_dict(probs, orient='index').reset_index()
        df = df.rename(columns={'index':'Word', 0:'Prob'})
        df['Similarity'] = similarities
        output = df.sort_values(['Similarity', 'Prob'], ascending=False).head()
        return(output)

Let's check for similarity of word 'nevertheless' from the set of words :

In [6]:

my_autocorrect('nevertheless')

Out[6]:

'Your word seems to be correct'

Here now we check for the wrong spelled word 'nevrtless' and it return the 5 most similar words ordered by similarity and probability.

In [7]:

my_autocorrect('nevrtless')

Out[7]:

	Word	Prob	Similarity
2571	nevertheless	0.000225	0.461538
10481	heartless	0.000018	0.454545
13600	nestle	0.000004	0.444444
16146	heartlessness	0.000004	0.428571
12513	subtleness	0.000004	0.416667

Most Viewed Articles

Build Autocorrect in Python ¶

Relative Frequency of words¶

Finding Similar Words ¶

Search Article

Popular ML Articles

Resources You Will Ever Need

Popular Searches

Go for Research

Consultation fee- 150 USD/hour

Select Thesis

Synopsis

Research Paper

Total cost (in USD): $0

PHD

Contact for custom package.

Most Viewed Articles

Build Autocorrect in Python ¶

Relative Frequency of words¶

Finding Similar Words ¶

Don't forget to share this Article!

Sharing is Caring

Search Article

Popular ML Articles

Resources You Will Ever Need

Popular Searches

Go for Research

Consultation fee- 150 USD/hour

Select Thesis

Synopsis

Research Paper

Total cost (in USD): $0

PHD

Contact for custom package.