Search Engines - a Beginner-Friendly Deep Dive

TF-IDF

Published 2/7/2025

Updated 2/7/2025

Jump to Page 📚

⚡️ Recap

In the last page, we analyzed the vocabulary of Æsop’s fables and identified that it follows a Zipfian distribution.

We used this intuition to shrink the size of our index by throwing away extremely high frequency terms, which tend to contain little context. We also threw away low-frequency terms which likely are noise from data processing or spelling mistakes.

In this page, we will look into using TF-IDF, a well-established information theory algorithm to enhance our indexing and searching through boosting the terms which matter and throwing away the terms which don’t.

Scoring Pages & Terms

In the past pages, we explored simple algorithms to build indices and relied on a raw application of Zipf’s law to optimize it. But our method was not perfect.

Just because a term is uncommon does not mean it is relevant
Our search results scoring isn’t optimal

TODO: improve this section

💡 TF-IDF: Term Frequency - Inverse Document Frequency

Not all terms in a document are equally important. Some terms terms appear frequently across many documents and tell us little about each document’s unique content. Other terms appear rarely, perhaps only in one or two documents, making them especially significant indicators of what those documents are about.

TF-IDF captures this intuition and assigns a high value to terms which are frequent within a particular document or a few documents but infrequent throughout the entire indexed corpus. This allows TF-IDF to identify terms which which uniquely characterize certain documents.

This significantly improves the quality of search and allows us to shrink the size of our index.

💡 Term Frequency (TF)

In the last page, where we applied Zipf’s law to reduce our index size, we focused on term frequency (TF). In TF-IDF, that is just one piece of the puzzle.

The role of TF is to boost terms which frequently seen in a document. We can mathematically defined it as such:

T F (t, d) = \frac{f_{t, d}}{\sum_{t^{'} \in d} (f_{t, d})}

where

$t$ is a term
$d$ is a document
$T F (t, d)$ is the term frequency of term $t$ in document $d$
$f_{t, d}$ is the count of the number of occurrences of $t$ in $d$
$\sum_{t^{'} \in d} (f_{t, d})$ us the count of all terms in document $d$ , including duplicates.

[1] Term frequency is the frequency of a term, $t$ , within a document. The numerator, $f_{t, d}$ is the raw count of the number of times the term, $t$ occurs in a document, $d$ .

However, there are more ways to define term frequency, some examples from Wikipedia [1]

Raw count $f_{t, d}$
Boolean ‘frequency’, $T F (t, d) = {\begin{cases} 1, & if term t occurs in document d, \\ 0, & otherwise. \end{cases}$
Log-scaled, $T F (t, d) = \log (1 + f_{t, d})$

💡 Inverse Document Frequency (IDF)

The role of inverse document frequency (IDF) is to determine how contextually important a term is to the document. It’s a logarithmically-scaled inverse fraction of the count of documents to the count of documents which contain term $t$

Mathematically noted,

I D F (t, D) = \log (\frac{N}{n_{t}})

Where

$t$ is a term
$D$ is a collection of documents (the corpus)
$N$ number of documents in $D$
$n_{t}$ number of documents in $D$ which contain $t$

In this document, I’ll be using my own flavor of IDF to prevent weird logarithmic behavior in extreme cases where a term is in no documents, OR a term is in all documents:

I D F (t, D) = \log (\frac{N + 1}{n_{t} + 1})

💡 TF-IDF

TF-IDF is defined as the product of TF and IDF.

TFIDF (t, d, D) = TF (t, d) \cdot IDF (t, D)

Given a term, $t$ , a document, $d_{0}$ , and a corpus, $D$ , a high score with TF-IDF is achieved when $d_{0}$ has a high TF of $t$ , and the TF of $t$ is low throughout the corpus.

🤔 Some Definitions (putting it all together)

TF

Acronym for term frequency

[2] the frequency of a term in a document

Higher frequency means higher importance of a term

T F (t, d) = \frac{f_{t, d}}{\sum_{t^{'} \in d} (f_{t, d})}

where

f_{t, d}

is the count of the number of occurrences of term

t

in document

d

IDF

Acronym for inverse document frequency

[2] reduces the weight of terms, common across multiple documents

If a term appears in fewer documents, it’s more likely to be important

I D F (t, D) = \log (\frac{N}{n_{t}})

where:

$N$ : total number of documents in the corpus
$n_{t}$ number of documents in which $t$ appears

TF-IDF

Acronym for term frequency - inverse document frequency

An algorithm to implement an intuition that infrequently-seen terms which are frequently seen in a document are important

[1] measure of importance of a term to a document relative to other documents in a corpus (collection of documents)

TF-IDF (t, d, D) = T F (t, d) \cdot I D F (t, D)

🧑‍💻 Code

Let’s use some Python code to visualize and understand what TF-IDF does:

from collections import Counter
import pandas as pd
import numpy as np
import math

# Here, we create 4 sample "documents" to implement TF-IDF on, as an array of words (terms)
corpus = [ document.split()
    for document in [
        'the cat and dog play',
        'the cat love dog and mouse',
        'dog jump under the tree',
        'cat play in the sun',
    ]
]

N = len(corpus)

# Calculate IDF for each term
idf_counter = Counter()
for d in corpus:
    vocab = set(d)
    for t in vocab:
        idf_counter[t] += 1

# Create a list of (term, IDF) tuples and a mapping dictionary for quick lookup
idf_list = [(term, math.log((1 + N) / (1 + freq))) for term, freq in idf_counter.items()]
idf_map = { term: idf for term, idf in idf_list }

tf_vectors = []

# Calculate TF for each document
for index, d in enumerate(corpus):
    counter = Counter(d)
    terms, freqs = zip(*counter.items())

    # Compute TF for each term (for all terms in the vocabulary)
    tf_values = { term: counter.get(term, 0) / doc_length for term in idf_map.keys() }
    tf_vectors.append(tf_values)

# Prepare DataFrame
df_data = {
    'Term': [ term for term, _ in idf_list ],
    'IDF': [ str(round(idf, 2)) for _, idf in idf_list]
}

df = pd.DataFrame(df_data).set_index('Term')

for index, d in enumerate(corpus):
    column_name = f'Doc {index + 1}'
    tf_with_tfidf = {
        term: (
            round(tf_values.get(term, 0.0), 2),
            round(tf_values.get(term, 0.0) * idf_map[term], 2)
        ) for term in idf_map.keys()
    }

    df[column_name] = df.index.map(lambda term: tf_with_tfidf[term])

for index, d in enumerate(corpus):
    print(f'Doc {index + 1}: {" ".join(d)}')

df.style.set_caption('TF-IDF Demonstration. Tuple values are (TF, TF-IDF)')

Console Logs

Doc 1: the cat and dog play

Doc 2: the cat love dog and mouse

Doc 3: dog jump under the tree

Doc 4: cat play in the sun

Display Output

TF-IDF Demonstration. Tuple values are (TF, TF-IDF)
	IDF	Doc 1	Doc 2	Doc 3	Doc 4
Term
and	0.51	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)
play	0.51	(0.2, 0.1)	(0.2, 0.1)	(0.2, 0.1)	(0.2, 0.1)
dog	0.22	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)
cat	0.22	(0.2, 0.04)	(0.2, 0.04)	(0.2, 0.04)	(0.2, 0.04)
the	0.0	(0.2, 0.0)	(0.2, 0.0)	(0.2, 0.0)	(0.2, 0.0)
mouse	0.92	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)
love	0.92	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)
tree	0.92	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)
jump	0.92	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)
under	0.92	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)	(0.0, 0.0)
sun	0.92	(0.2, 0.18)	(0.2, 0.18)	(0.2, 0.18)	(0.2, 0.18)
in	0.92	(0.2, 0.18)	(0.2, 0.18)	(0.2, 0.18)	(0.2, 0.18)

🤨 Analysis

Given our corpus, D:

‘the cat and dog play’
‘the cat love dog and mouse’
‘dog jump under the tree’
‘cat play in the sun’

Notice how the word ‘the’ has an IDF score of $0$ . Why? This low IDF score is due to the word “the” being in every single document in our corpus. Since $x \cdot 0 = 0$ , every instance of the word “the” will have a TF-IDF score of $0$ .

On the other hand, the word “jump” has an IDF score of $0.92$ . That is because it is only seen in $1 / 4$ documents. Thus, $\ln (\frac{1 + 1}{4 + 1}) \approx 0.92$

Looking at the above table, you can see that so many of the words have a TF-IDF score of $0$ , meaning they are unlikely to hold much important context.

🔎 Using this in Search

We can attempt to reduce our search problem by only indexing documents and terms with a minimum TF-IDF score. Let’s try it with Æsop’s fables!

📚 Sources

Term-frequency Optimization & Zipf's Law

Errata & Other Content not on Other Pages