Natural Language Processing: Fundamentals

Term Frequency

$tf(t,d) = f_{td}, t e T$ and $d e D$
$tf(t,d)$ : # occurences of t in doc d"/" word in d"
$tf(t,d) = f_{t,d}$ : term "to" document, so divide $\frac{term count}{total terms}$ ?
#' of times that term $t$ occurs in $d$ document'

Inverse Document Frequency

$idf(t,d) = log$ $\frac{N}{|{d e D : t e d}|}$ = $log\frac{|D|}{|n_{t}|}$
- $idf(t,d)$ : where $n_{t} = d e D$ with $t e d$ , "total docs "/" # docs w/ t"
  - $N$ : total # of doc's in the corpus, $N$ = |D|
  - $|{d e D : t e d}|$ : # of doc's where the term $t$ appears.
Is a measure of how much information the word provides (i.e., if it's common/rare across all documents.)

Cosine Similarity

$cos(\theta) = \frac{A \cdot B}{||A|| \cdot ||B||}$
$cos(\theta)$ : angle between the two vectors, $A$ & $B$
$A \cdot B$ : dot product of the two vectors
$||A|| \cdot ||B||$ : product of the magnitudes of the two vectors
The cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

Tokens

In a language model, the atomic unit that the model is training on and making predictions on. A token is typically one of the following:

A word: Example: the phrase "dogs like cats" consists of 3 word tokens: "dogs", "like", "cats".
A subword: In which a single word can be a single toke or multiple tokens -> A subword consists of a root word, a prefix, or a suffix. Example: "dogs" -> 2 -> dog & suffix "s" or "taller" -> 2 -> tall & suffix "er"
A character: Example -> the phrase "bike fish" consists of 9 character tokens. (note: blank space counts as one of the tokens)

Stemming

It is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. A great tool to use, would be the PorterStemmer2 library.

Lemmatization

It looks beyond word reduction and considers a language's full vocabulary to apply a morphological analysis to words, aiming to remove inflectional endings only & to return the base or dictionary forms of a word, which is known as the lemma.

meeting => meet (core-word extraction)
was => be (tense conversion to present tense)
mice => mouse (plural to singular)

Word Embedding is capable of capturing context of a word in a document, semantic & syntatic similarity, relation with other words, etc.