Data Structures

Project: Spell Checker

Objective

During this activity, students will be able to:


Description

This activity must be developed in the pre-assigned teams of two.

In a source file called spellchecker.cpp write a complete C++ program that takes as a command line argument the name of an input text file containing some text in English and checks to see if it has any spelling errors, in which case it gives suggestions of alternative words to use.

For example, assume you have the following file called tooinkle.txt:

Tooinkle, tooinkle, little bat!
How I wandor what you're at!
Up above the wurlt you flai,
Like a tea-tray in the Jabberwockybandersnatch.

When running your program from the terminal like this:

./spellchecker tooinkle.txt

the expected output should be:

Unrecognized word: "Tooinkle". First found at line 1, column 1.
Suggestions: tanaquil, taneously, tangail, tangala, tangela,
tangelo, tanglao, tangle, tanjil, tansley, temecula, tenaglia,
tencel, tensely, tensile, tenuously, timeously, timshel, tingle,
tingley, tingly, tinguely, tinkle, tinsel, tinsley, tinuously,
tnweakle, tomasulo, tomsal, tonsil, townsley, tunceli, twinkle,
twinkly

Unrecognized word: "wandor". First found at line 2, column 7.
Suggestions: wander, wanter, wender, wendouree, whynter, winder,
windeyer, windir, windrow, winter, wintery, winther, wintour,
wintry, wmweather, wntr, wonder, wondir, wounder, wunder,
wwwamateur, wwwmature, wynder, wynter

Unrecognized word: "wurlt". First found at line 3, column 14.
Suggestions: wereld, whirled, whorled, wiiralt, world, worlde,
worldy, worlwide, woworld, wrld, wurld

Unrecognized word: "flai". First found at line 3, column 24.
Suggestions: fahl, faial, fail, faile, faily, fal, fala, falah,
fale, fali, falu, faul, fawley, feal, feel, feeley, feely, feil,
feile, fel, fela, fele, feli, feliu, feliway, feola, fhl, fial,
fiala, fiel, fil, fila, file, fileio, filey, filho, fili, filia,
filii, filio, filo, fiol, fl, fla, flaw, flay, fle, flea, flee,
flew, flh, fli, flie, flo, floe, floo, flow, flowe, flowy, floy,
flu, flue, fluo, flw, fly, flyaway, flye, flyway, foal, foale,
foaol, foel, foil, fol, fola, foley, folha, folia, folie, folio,
foloe, folow, fooal, fooel, fool, foola, foole, foolo, foolw,
fooly, foooel, foool, fooool, foul, foula, foule, fowl, fowle,
fowley, fowlie, foyil, foyle, fuel, ful, fula, fule, fwl, fyle

Unrecognized word: "Jabberwockybandersnatch". First found at line 4,
column 24.
No suggestions.

You must note the following:

  1. A word is considered misspelled (unrecognized) if it’s not one of the words listed in the words.txt file (2.69 MiB).

  2. A misspelled word is only reported once when it’s found for the first time, even if it appears multiple times later on.

  3. When reporting a misspelled word you must display in what row and column it was found. Also, you must provide a comma separated list of word suggestions. These suggestions are all the words from words.txt that have the same Soundex code as the misspelled word. If no such words exist, the program should print “No suggestions” for that specific misspelled word.

  4. An empty line should follow each reported misspelled word.

  5. The program should display nothing if there are no misspelled words in the input file.

  6. The program should be implemented so that it runs as fast as possible. Be wise and choose the most convenient data structures and algorithms in order to achieve this. You’re allowed to implement your own algorithms/data structures, use the ones provided by the C++ standard library, or a combination of both.

Test Input Files

Make sure to test your program with the following input files:

Soundex

Soundex is a phonetic algorithm that can locate words with similar sounds. A Soundex search method takes a word as input and outputs a character string that identifies a group of words that are (roughly) phonetically similar or sound (approximately) the same.

The Soundex algorithm to find Soundex index is as follows:

For example, to compute the Soundex code of “Monterrey”:

So, the Soundex code of “Monterrey” is M536600.

[source]

Starter Code

You can use the following code snippet to read the full contents of an existing text file and extract its individual words. The function read_words receives the name of an input file and a vector of word structures passed by reference where the result is to be placed. At the end, each element in the vector will contain the text of a single word as well as the line and column numbers where it was found. The function returns true if the operation was successful, or false otherwise.

#include <fstream>
#include <regex>
#include <vector>

struct word {
    std::string text;
    int line;
    int column;
};

bool read_words(
        const std::string input_file_name,
        std::vector<word>& words)
{
    std::ifstream input_file(input_file_name);
    if (input_file.fail()) {
        return false;
    }
    std::regex reg_exp("[a-zA-Z]+");
    std::smatch match;
    std::string text;
    int line = 0;
    int column = 0;
    while (std::getline(input_file, text)) {
        ++line;
        column = 1;
        while (std::regex_search(text, match, reg_exp)) {
            column += match.position();
            words.push_back({match.str(), line, column});
            column += match.length();
            text = match.suffix().str();
        }
    }
    return true;
}

The following code shows how you can use the read_words function from some other part of your program:

std::string file_name = "tooinkle.txt";
std::vector<word> words;

if (read_words(file_name, words)) {
    for (word w : words) {
        std::cout << w.text << "  (line " << w.line
            << ", column " << w.column << ")\n";
    }
} else {
    std::cout << "Unable to read file: "
        << file_name << "\n";
}

The expected output is:

Tooinkle  (line 1, column 1)
tooinkle  (line 1, column 11)
little  (line 1, column 21)
bat  (line 1, column 28)
How  (line 2, column 1)
I  (line 2, column 5)
wandor  (line 2, column 7)
what  (line 2, column 14)
you  (line 2, column 19)
re  (line 2, column 23)
at  (line 2, column 26)
Up  (line 3, column 1)
above  (line 3, column 4)
the  (line 3, column 10)
wurlt  (line 3, column 14)
you  (line 3, column 20)
flai  (line 3, column 24)
Like  (line 4, column 1)
a  (line 4, column 6)
tea  (line 4, column 8)
tray  (line 4, column 12)
in  (line 4, column 17)
the  (line 4, column 20)
Jabberwockybandersnatch  (line 4, column 24)

Deliverables

Place in a comment at the top of the spellchecker.cpp source file the authors’ personal information (student ID and name), for example:

/*----------------------------------------------------------
 * Project: Spell Checker
 *
 * Date: 30-Nov-2022
 * Authors:
 *           A01770771 Kamala Khan
 *           A01777771 Carol Danvers
 *----------------------------------------------------------*/

Create a ZIP file called spellchecker.zip containing the spellchecker.cpp source file plus any other files required to build and run your application (Makefile, words.txt, test input files, etc.).


Upload Instructions

To deliver the spellchecker.zip file, please provide the following information:

Request PIN

Only one team member needs to upload the file.

Due date is Wednesday, November 30.