Natural Language Processing: Fundamentals
Term Frequency
- and
- : # occurences of t in doc d"/" word in d"
- : term "to" document, so divide ?
- #' of times that term occurs in document'
Inverse Document Frequency
- =
- : where with , "total docs "/" # docs w/ t"
- : total # of doc's in the corpus, = |D|
- : # of doc's where the term appears.
- : where with , "total docs "/" # docs w/ t"
- Is a measure of how much information the word provides (i.e., if it's common/rare across all documents.)
Cosine Similarity
- : angle between the two vectors, &
- : dot product of the two vectors
- : product of the magnitudes of the two vectors
- The cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
Tokens
In a language model, the atomic unit that the model is training on and making predictions on. A token is typically one of the following:
- A word: Example: the phrase "dogs like cats" consists of 3 word tokens: "dogs", "like", "cats".
- A subword: In which a single word can be a single toke or multiple tokens -> A subword consists of a root word, a prefix, or a suffix. Example: "dogs" -> 2 -> dog & suffix "s" or "taller" -> 2 -> tall & suffix "er"
- A character: Example -> the phrase "bike fish" consists of 9 character tokens. (note: blank space counts as one of the tokens)
Stemming
It is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. A great tool to use, would be the PorterStemmer2 library.
Lemmatization
It looks beyond word reduction and considers a language's full vocabulary to apply a morphological analysis to words, aiming to remove inflectional endings only & to return the base or dictionary forms of a word, which is known as the lemma.
- meeting => meet (core-word extraction)
- was => be (tense conversion to present tense)
- mice => mouse (plural to singular)
Word Embedding
is capable of capturing context of a word in a document, semantic & syntatic similarity, relation with other words, etc.