Build a Python Information Retrieval Engine: Inverted Index & TF-IDF Tutorial

Introduction: Why Build Your Own Search Engine?

Every time you search Google, ask ChatGPT a question, or look up a product on Amazon, an information retrieval (IR) system works behind the scenes to deliver relevant results. In this tutorial, you'll learn how to build your own simple IR engine in Python, similar to what you'd implement for a CS 1656 assignment. By constructing an inverted index and computing TF-IDF scores, you'll gain hands-on experience with core concepts that power modern search, from academic databases to AI-powered knowledge retrieval.

What You'll Learn

Text preprocessing: lowercasing, punctuation removal, number elimination, and stemming using NLTK
Building an inverted index with term frequencies and document frequencies
Computing TF-IDF weights: w(key, doc) = (1 + log2 tf) * log2(N / df)
Ranking documents by relevance to a keyword query
Storing and loading data structures as JSON for persistence

Project Overview: pine-index.py and pine-search.py

You'll create two Python programs. The first, pine-index.py, reads text files from an input/ directory, preprocesses the text, and builds an inverted index saved as inverted-index.json. The second, pine-search.py, loads that index, reads keywords from keywords.txt, and outputs ranked document lists with scores and weight breakdowns.

This assignment structure mirrors real-world search pipelines: indexing offline, then querying online. Just like how a search engine pre-indexes the web so that your query returns results in milliseconds, your indexer does the heavy lifting once, and the searcher reuses it.

Step 1: Text Preprocessing

Before you can index documents, you need to clean them. The assignment specifies four preprocessing steps:

Convert to lowercase – ensures case-insensitive matching (e.g., 'Python' and 'python' are the same).
Eliminate punctuation – remove characters like .,!?;:"'()[] so that 'hello!' becomes 'hello'.
Eliminate numbers – remove numeric tokens (e.g., '2026', '3.14') because they rarely carry semantic weight in text retrieval.
Stemming – reduce words to their root form using the NLTK Porter stemmer (e.g., 'running' becomes 'run', 'better' becomes 'better').

Here's a sample preprocessing function:

import re
from nltk.stem import PorterStemmer

def preprocess(text):
    stemmer = PorterStemmer()
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    text = re.sub(r'\d+', '', text)       # remove numbers
    words = text.split()
    return [stemmer.stem(word) for word in words if word]

This step is crucial: without it, your inverted index would treat 'Search', 'search', and 'searching' as different terms, reducing recall.

Step 2: Building the Inverted Index

An inverted index maps each term to a list of documents that contain it, along with the term frequency in each document. You also need to track the document frequency (number of documents containing the term) and total document count N.

Your data structure could be a dictionary of dictionaries:

inverted_index = {
    'word': {
        'doc_freq': 5,  # number of documents containing 'word'
        'postings': {
            'doc1.txt': 3,  # term frequency in doc1
            'doc2.txt': 1,
            ...
        }
    },
    ...
}

To build it, iterate over each file in the input/ directory, preprocess the text, and for each stemmed term, update the postings list and document frequency. After processing all documents, store the index as JSON:

import json
with open('inverted-index.json', 'w') as f:
    json.dump(inverted_index, f, indent=2)

This index is your search engine's brain—it lets you quickly find which documents contain a query term without scanning every file.

Step 3: Computing TF-IDF Weights

TF-IDF (Term Frequency-Inverse Document Frequency) is a classic weighting scheme that balances how often a term appears in a document (TF) against how rare it is across the collection (IDF). The formula from the assignment:

w(key, doc) = (1 + log2 freq(key, doc)) * log2 (N / n(doc))

Where:

freq(key, doc) = number of times the term appears in the document
n(doc) = number of documents containing the term (document frequency)
N = total number of documents

For example, if a term appears 4 times in a document (freq=4), appears in 2 out of 9 documents (n(doc)=2, N=9), then:

w = (1 + log2(4)) * log2(9/2) = (1 + 2) * log2(4.5) ≈ 3 * 2.17 = 6.51

The final relevance score for a document is the sum of TF-IDF weights across all query keywords.

Step 4: Implementing pine-search.py

Your search program should:

Load the inverted index from inverted-index.json.
Read each line of keywords.txt (space-separated keywords).
For each keyword, preprocess it the same way (lowercase, stem, etc.).
Look up each stemmed keyword in the index to get its postings and document frequency.
Compute TF-IDF weights for each document that contains any keyword.
Sum weights per document to get a relevance score.
Sort documents by descending score, breaking ties by filename order.
Output in the specified format, including a breakdown of weights per keyword.

Here's a snippet for computing scores:

import math

def compute_score(query_terms, index, N):
    scores = {}
    for term in query_terms:
        if term not in index:
            continue
        doc_freq = index[term]['doc_freq']
        idf = math.log2(N / doc_freq)
        for doc, tf in index[term]['postings'].items():
            weight = (1 + math.log2(tf)) * idf
            if doc not in scores:
                scores[doc] = {'total': 0.0, 'weights': {}}
            scores[doc]['total'] += weight
            scores[doc]['weights'][term] = weight
    return scores

Then sort and print in the required format, handling tied ranks by assigning the same rank number.

Real-World Connections: From Class to Career

Understanding inverted indexes and TF-IDF is not just for assignments. Modern AI systems like ChatGPT use retrieval-augmented generation (RAG) to pull relevant documents before generating answers. Search engines like Elasticsearch rely on inverted indexes for lightning-fast full-text search. Even recommendation systems use TF-IDF-like weighting to find similar items.

As of 2026, the rise of AI-powered search tools has made IR skills more valuable than ever. Whether you're building a chatbot that answers course FAQs or a research tool for a startup, the principles you learn here are directly applicable.

Common Pitfalls and Tips

Consistent preprocessing: Ensure both indexer and searcher apply identical preprocessing to query terms. Otherwise, a query for 'Running' might not match 'run' due to stemming mismatch.
Handle missing terms: If a keyword doesn't appear in the index, skip it gracefully.
JSON serialization: Make sure your dictionary keys are strings (they are by default) and that you use json.dump correctly.
Logarithm base: The formula uses log base 2. Use math.log2() in Python 3.
Testing: Use the provided sample documents and keywords to verify your output matches the expected format.

Conclusion

By completing this tutorial, you've built a functional information retrieval system from scratch. You've learned how to preprocess text, construct an inverted index, compute TF-IDF weights, and rank documents by relevance. These skills are foundational for anyone pursuing data science, AI, or software engineering. Now go ahead and implement pine-index.py and pine-search.py—you've got all the tools you need.