Skip to main content

Natural Language Processing: Fundamentals

Term Frequency
  • tf(t,d)=ftd,teTtf(t,d) = f_{td}, t e T and deDd e D
  • tf(t,d)tf(t,d) : # occurences of t in doc d"/" word in d"
  • tf(t,d)=ft,dtf(t,d) = f_{t,d} : term "to" document, so divide termcounttotalterms\frac{term count}{total terms}?
  • #' of times that term tt occurs in dd document'
Inverse Document Frequency
  • idf(t,d)=logidf(t,d) = log N∣deD:ted∣\frac{N}{|{d e D : t e d}|} = log∣D∣∣nt∣log\frac{|D|}{|n_{t}|}
    • idf(t,d)idf(t,d) : where nt=deDn_{t} = d e D with tedt e d, "total docs "/" # docs w/ t"
      • NN: total # of doc's in the corpus, NN = |D|
      • ∣deD:ted∣|{d e D : t e d}|: # of doc's where the term tt appears.
  • Is a measure of how much information the word provides (i.e., if it's common/rare across all documents.)
Cosine Similarity
  • cos(θ)=Aâ‹…B∣∣A∣∣⋅∣∣B∣∣cos(\theta) = \frac{A \cdot B}{||A|| \cdot ||B||}
  • cos(θ)cos(\theta): angle between the two vectors, AA & BB
  • Aâ‹…BA \cdot B: dot product of the two vectors
  • ∣∣A∣∣⋅∣∣B∣∣||A|| \cdot ||B||: product of the magnitudes of the two vectors
  • The cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
Tokens

In a language model, the atomic unit that the model is training on and making predictions on. A token is typically one of the following:

  • A word: Example: the phrase "dogs like cats" consists of 3 word tokens: "dogs", "like", "cats".
  • A subword: In which a single word can be a single toke or multiple tokens -> A subword consists of a root word, a prefix, or a suffix. Example: "dogs" -> 2 -> dog & suffix "s" or "taller" -> 2 -> tall & suffix "er"
  • A character: Example -> the phrase "bike fish" consists of 9 character tokens. (note: blank space counts as one of the tokens)
Stemming

It is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. A great tool to use, would be the PorterStemmer2 library.

Lemmatization

It looks beyond word reduction and considers a language's full vocabulary to apply a morphological analysis to words, aiming to remove inflectional endings only & to return the base or dictionary forms of a word, which is known as the lemma.

  • meeting => meet (core-word extraction)
  • was => be (tense conversion to present tense)
  • mice => mouse (plural to singular)
Word Embedding is capable of capturing context of a word in a document, semantic & syntatic similarity, relation with other words, etc.