Build a Vowel Restoration FST: CS539 Homework Guide

Introduction: Why Remove Vowels?

Have you ever seen a text like THS S NT NGLSH WTHT VWLS? While it might look like a puzzle, in many languages such as Arabic and Hebrew, vowels are often omitted in writing. This assignment from CS539 explores building a finite-state machine to automatically restore vowels. But why is this relevant today? In 2026, AI-powered text restoration is used in apps like autocorrect, predictive text, and even in decoding ancient manuscripts. Understanding FSTs gives you a powerful tool for natural language processing (NLP).

Understanding the Assignment

The task involves creating a finite-state acceptor (FSA) for English words and a finite-state transducer (FST) that removes vowels. Then, by composing these, you restore vowels with improved accuracy. The dataset includes a vocabulary list (vocab) and sample strings.

Step 1: Building the FSA for English Words

First, you need to create an FSA english.fsa that accepts all strings of English words separated by underscores. The FSA should be letter-based. Write a Python script make.py that reads vocab and outputs the FSA in Carmel format. A prefix tree (trie) approach works well. For example, for words "AN", "AT", "AGE", the FSA would have transitions for each letter. Use carmel -c english.fsa to count states and transitions—expect around 250k states.

Step 2: Creating the Vowel-Removal FST

Next, build remove-vowels.fst that maps input letters to output letters, deleting vowels (A, E, I, O, U). The FST must preserve word boundaries (underscores). For instance, Y O U _ A R E should output Y _ R. Test it with carmel -slibOEWk 1 remove-vowels.fst on the provided strings.

Step 3: Vowel Restoration via Backward Application

Using the FST in reverse, you can restore vowels. Run carmel -sribIEWk 1 remove-vowels.fst on strings.novowels. But accuracy is only about 1.3% because the FST alone has no language model—it guesses vowels randomly.

Step 4: Improving Accuracy with FSA Composition

To boost accuracy, compose your FSA with the FST. This constrains the output to valid English words. Use carmel -sribIEWk 1 english.fsa remove-vowels.fst. Accuracy jumps to around 30%, but still low because the FSA is a simple word list without context.

Step 5: Further Improvements

To exceed 90%, incorporate n-gram language models. For example, use a bigram or trigram WFSA instead of a unigram. Alternatively, use a weighted FST that assigns probabilities to vowel sequences. In 2026, such techniques power modern autocorrect and speech recognition.

Conclusion

By combining FSAs and FSTs, you can build effective vowel restoration systems. This tutorial gives you a solid foundation for finite-state methods in NLP.