Back

Published on July 19th, 2019

Algorithms

# Build your own toy spellchecker

Wrote a simple toy spellchecker using C++ by taking inspiration from Peter Norvig's article.

Spellcheckers and autocorrect, aren't they magical? They do feel like magic to me. I mean, how is it able to predict what word do you want to type?

According to Norvig, some of the most talented engineers at Google don't have a complete understanding of how a spellchecker works.

Peter Norvig's solution in Python is so elegant and simple, I don't think anyone can write better than that. However, I wanted to know how it works, so I thought of building it to understand it's functionality. Also, an excuse to exercise my C++ skills.

So, are you as curious as I am? If you are, I think you're in the right spot.

## How it works?

Okay, so think about it? What does a spellchecker do? You type in a misspelled word and it returns a word with the highest probability, right?

Well, there's a little bit more to it.

### Create a word dictionary

First, we must create a dictionary, in order to do that, you need to extract words from a large piece of text and store it in a Hashmap in which each word will have a word count. In this example, I've used a Sherlock Holmes novel (which is around 6MB). The words are extracted from a novel instead of an actual dictionary because it can be used to create a simple Language Model.

Source code to create a dictionary:

```    ```
void SpellChecker::extractWords(string &filename)
{
ifstream infile;
infile.open(filename);
string x;
while(infile >> x)
{
x = filterAlpha(x);
dictionary[x]++;
}
}

string SpellChecker::filterAlpha(string &word)
{
for(int i=0; i<word.size(); i++)
{
char ch = word[i];

if(ch < 0)
{
word[i] = '-';
}

if(isalpha(ch))
{
word[i] = tolower(ch);
}
}

return word;
}
```
```

### Create a list of candidates

Second, we must able to predict/hypothesize the ways of editing text when the user types. It could be one of the following types of editing:

Based on the types of edits a user could make, we can generate a list of possible candidates by creating permutations using these edit methods mentioned above.

#### Adding a letter

In this method, you generate a list of candidates by inserting a letter in every iteration.

```    ```
void SpellChecker::inserts(string &word, Vector &result)
{
for(int i=0; i<word.size()+1; i++)
{
for(int j=0; j<alphabets.size(); j++)
{
char ch = alphabets[j];
result.push_back(word.substr(0,i) + ch + word.substr(i));
}
}
}
```
```

#### Replacing a letter

In this method, you generate a list of candidates by replacing each character with a letter from a list of alphabets in every iteration.

```    ```
void SpellChecker::replaces(string &word, Vector &result)
{
for(int i=0; i<word.size(); i++)
{
for(int j=0; j<alphabets.size(); j++)
{
char ch = alphabets[j];
result.push_back(word.substr(0,i) + ch + word.substr(i+1));
}
}
}
```
```

#### Switching two adjacent letters

In this method, you generate a list of candidates by switcing two adjacent letters in every iteration. For example: the word "ornage" would look like this: "orange", when the letters "n" and "a" are swapped.

```    ```
void SpellChecker::transposes(string &word, Vector &result)
{
for(int i=0; i<word.size()-1; i++)
{
result.push_back(word.substr(0,i) + word[i+1] + word[i] + word.substr(i+2));
}
}
```
```

#### Removing a letter

In this method, you generate a list of candidates by removing a letter in every iteration.

```    ```
void SpellChecker::deletes(string &word, Vector &result)
{
for(int i=0; i<word.size(); i++)
{
result.push_back(word.substr(0,i) + word.substr(i+1));
}
}
```
```

All of these methods are called in one wrapper method:

```    ```
void SpellChecker::edits(string &word, Vector &result)
{
//Deletion
deletes(word, result);

//Transposition
transposes(word, result);

//Replacement
replaces(word, result);

//Insertion
inserts(word, result);
}
```
```

### Extract the known words

Third, at this stage, the above step would've generated a huge list of words but 90% of them would be gibberish, so we need to "clean" the list and extract the known words using the dictionary we've created.

```    ```
void SpellChecker::known_words(Vector& results, Dictionary &candidates)
{
Dictionary::iterator end = dictionary.end();

for(int i=0; i<results.size(); i++)
{
Dictionary::iterator val = dictionary.find(results[i]);

if(val != end)
{
candidates[val->first] = val->second;
}
}
}
```
```

The edits() method apply to words that have a edit distance of 1, what if it was 2 or more? Like if the user typed "the", it could've been "then" or "they". So, all you have to do is create a method that generates a new set of permutations based on the already generated list of edited words and extract the known words.

```    ```
void SpellChecker::edits2(Vector &result, Dictionary &candidates)
{
for(int i=0; i<result.size(); i++)
{
Vector edit2;

edits(result[i], edit2);
known_words(edit2, candidates);
}
}
```
```

### Display the correct word

In order to determine the correct word, the following possibilities are considered:

1. Check if this word is in the dictionary, if it does, display it.
2. Generate known words that have an edit distance of 1 and check in the dictionary, if it does, display it.
3. Generate known words that have an edit distance of 2 and check in the dictionary, if it does, display it.
4. If all else fails, this word is unique or not a known word.
```    ```
string SpellChecker::correct(string &word)
{
Vector result;
Dictionary candidates;

string file = "big.txt";

//1. if it's in the dictionary, display it
if(dictionary.find(word) != dictionary.end())
{
return word;
}

extractWords(file);

edits(word, result);
known_words(result, candidates);

//2. if it's a known word but one edit away
if(candidates.size() > 0)
{
return max_element(candidates.begin(), candidates.end())->first;
}

//3. if it's a known word but two edits away
edits2(result, candidates);

if(candidates.size() > 0)
{
return max_element(candidates.begin(), candidates.end())->first;
}

//4. Display nothing if it doesn't exist
return "This word doesn't exist!";
}
```
```

However, for conditions 2 and 3, the word displayed would most likely have the highest word frequency in the dictionary.

## Conclusion

Phew! I hope that wasn't a long read. Although, I've written this on C++, I plan to rewrite this on Javascript for some of my future projects.

To be honest, I don't think it's completely accurate, although, I got most of the words correct when tested.

The source code can be found on my GitHub repository.