Document Embeddings and TF-IDF
Last updated on 2025-03-21 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- What is a document embedding?
- What is TF-IDF?
Objectives
- Produce TF-IDF matrix on a corpus
- Understand how TF-IDF relates to rare/common words
The method of using word counts is just one way we might embed a
document in vector space.
Let’s talk about more complex and representational ways of constructing
document embeddings.
To start, imagine we want to represent each word in our model
individually, instead of considering an entire document. How individual
words are represented in vector space is something called “word
embeddings” and they are an important concept in NLP.
One hot encoding: Limitations
How would we make word embeddings for a simple document such as “Feed the duck”?
Let’s imagine we have a vector space with a million different words in our corpus, and we are just looking at part of the vector space below.
dodge | duck | … | farm | feather | feed | … | tan | the | |
---|---|---|---|---|---|---|---|---|---|
feed | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ||
the | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ||
duck | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ||
-—– | -—— | -—– | -—- | -—– | -——– | -—– | -—- | -—- | -—- |
Document | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
Similar to what we did in the previous lesson, we can see that each word embedding gives a 1 for a dimension corresponding to the word, and a zero for every other dimension. This kind of encoding is known as “one hot” encoding, where a single value is 1 and all others are 0.
Once we have all the word embeddings for each word in the document, we sum them all up to get the document embedding. This is the simplest and most intuitive way to construct a document embedding from a set of word embeddings.
But does it accurately represent the importance of each word?
Our next model, TF-IDF, will embed words with different values rather than just 0 or 1.
TF-IDF Basics
Currently our model assumes all words are created equal and are all equally important. However, in the real world we know that certain words are more important than others.
For example, in a set of novels, knowing one novel contains the word the 100 times does not tell us much about it. However, if the novel contains a rarer word such as whale 100 times, that may tell us quite a bit about its content.
A more accurate model would weigh these rarer words more heavily, and more common words less heavily, so that their relative importance is part of our model.
However, rare is a relative term. In a corpus of documents about blue whales, the term whale may be present in nearly every document. In that case, other words may be rarer and more informative. How do we determine how these weights are done?
One method for constructing more advanced word embeddings is a model called TF-IDF.
TF-IDF stands for term frequency-inverse document frequency and can be calculated for each document, d, and term, t, in a corpus. The calculation consists of two parts: term frequency and inverse document frequency. We multiply the two terms to get the TF-IDF value.
Term frequency(t,d) is a measure for how frequently a term, t, occurs in a document, d. The simplest way to calculate term frequency is by simply adding up the number of times a term occurs in a document, and dividing by the total word count in the document.
Inverse document frequency measures a term’s importance. Document frequency is the number of documents, N, a term occurs in, so inverse document frequency gives higher scores to words that occur in fewer documents. This is represented by the equation:
IDF(t) = ln[(N+1) / (DF(t)+1)]
where…
- N represents the total number of documents in the corpus
- DF(t) represents document frequency for a particular term/word, t. This is the number of documents a term occurs in.
The key thing to understand is that words that occur in many documents produce smaller IDF values since the denominator grows with DF(t).
We can also embed documents in vector space using TF-IDF scores rather than simple word counts. This also weakens the impact of stop-words, since due to their common nature, they have very low scores.
Now that we’ve seen how TF-IDF works, let’s put it into practice.
Worked Example: TD-IDF
Earlier, we preprocessed our data to lemmatize each file in our corpus, then saved our results for later.
Let’s load our data back in to continue where we left off:
PYTHON
from pandas import read_csv
data = read_csv("/content/drive/My Drive/Colab Notebooks/text-analysis/data/data.csv")
TD-IDF Vectorizer
Next, let’s load a vectorizer from sklearn
that will
help represent our corpus in TF-IDF vector space for us.
PYTHON
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(input='filename', max_df=.6, min_df=.1)
Here, max_df=.6
removes terms that appear in more than
60% of our documents (overly common words like the, a, an) and
min_df=.1
removes terms that appear in less than 10% of our
documents (overly rare words like specific character names, typos, or
punctuation the tokenizer doesn’t understand). We’re looking for that
sweet spot where terms are frequent enough for us to build theoretical
understanding of what they mean for our corpus, but not so frequent that
they can’t help us tell our documents apart.
Now that we have our vectorizer loaded, let’s used it to represent our data.
OUTPUT
(41, 9879)
Here, tfidf.shape
shows us the number of rows (books)
and columns (words) are in our model.
Check Your Understanding: max_df
and min_df
Try different values for max_df
and min_df
.
How does increasing/decreasing each value affect the number of columns
(words) that get included in the model?
Increasing max_df
results in more words being included
in the more, since a higher max_df
corresponds to accepting
more common words in the model. A higher max_df
accepts
more words likely to be stopwords.
Inversely, increasing min_df
reduces the number of words
in the more, since a higher min_df
corresponds to removing
more rare words from the model. A higher min_df
removes
more words likely to be typos, names of characters, and so on.
Inspecting Results
We have a huge number of dimensions in the columns of our matrix (just shy of 10,000), where each one of which represents a word. We also have a number of documents (about forty), each represented as a row.
Let’s take a look at some of the words in our documents. Each of these represents a dimension in our model.
OUTPUT
array(['15th', '1st', 'aback', 'abandonment', 'abase'], dtype=object)
What is the weight of those words?
OUTPUT
[2.79175947 2.94591015 2.25276297 2.25276297 2.43508453]
Let’s show the weight for all the words:
PYTHON
from pandas import DataFrame
tfidf_data = DataFrame(vectorizer.idf_, index=vectorizer.get_feature_names_out(), columns=["Weight"])
tfidf_data
OUTPUT
Weight
15th 2.791759
1st 2.945910
aback 2.252763
abandonment 2.252763
abase 2.435085
... ...
zealously 2.945910
zenith 2.791759
zest 2.791759
zigzag 2.945910
zone 2.791759
That was ordered alphabetically. Let’s try from lowest to heighest weight:
OUTPUT
Weight
unaccountable 1.518794
nest 1.518794
needless 1.518794
hundred 1.518794
hunger 1.518794
... ...
incurably 2.945910
indecent 2.945910
indeed 2.945910
incantation 2.945910
gentlest 2.945910
Your Mileage May Vary
The results above will differ based on how you configured your tokenizer and vectorizer earlier.
Values are no longer just whole numbers such as 0, 1 or 2. Instead, they are weighted according to how often they occur. More common words have lower weights, and less common words have higher weights.
TF-IDF Summary
In this lesson, we learned about document embeddings and how they could be done in multiple ways. While one hot encoding is a simple way of doing embeddings, it may not be the best representation. TF-IDF is another way of performing these embeddings that improves the representation of words in our model by weighting them. TF-IDF is often used as an intermediate step in some of the more advanced models we will construct later.
Key Points
- Some words convey more information about a corpus than others
- One-hot encodings treat all words equally
- TF-IDF encodings weigh overly common words lower