#cs451

Probabilistic Model

P(“I saw a van”) = P(“I”) x P(“saw” | “I”) x P(“a” | “I saw”) x P(“van” | “I saw a”)

This is not possible because the size of a sentence is unbounded.

Smaller Limit: N-Gram

Probability of next word only depends on the previous N-1 words.

Unigram: Bigram:

If a single n-gram in the sentence has never been seen, P=0, the entire probability becomes 0. One unusual word can make a sentence from “likely” to “impossible”.

Solution: we can remove 0s - Laplace Smoothing

  • start each count at 1

+ because every pair of words has a +1, V = vocabulary size

Other Smoothing Techniques

  • Good-Turing
  • Katz backoff
  • Jelinek Mercer
  • Dirichlet Smoothing
  • Kneser-Ney

Searching - Information Retrieval

Inverted Index

Map context to documents

Ex:

blue -> 2
cat -> 3
fish -> 1 -> 2

Key is word, value is index of the document they appear in

Zipf’s Law

  • A few elements occur very frequently
  • Many elements occur very infrequently

Delta Compression

  • if a term is rare => not many postings
  • if a term is common => average delta is small

Hence we can use delta encoding (gap encoding)

Ex:

1, 6, 11, 15, 22 // sequence
1, 5, 5, 4, 7 // gaps

This saves space because we can use other datatypes (not int)

VInt (variable-width integer type) uses 1-5 bytes to represent an int.

Also, use BytesWritable instead of ArraysWritable[VInt] because array storing is heavy.

WORD Level: Simple 9 - Simple 14 (different ways of storing a variable number of values in a single word) BYTE Level: VInt - VLong BIT Level: Elias Code

Elias Gamma Code

Assumptions:

  • Natural numbers with no upper bound
  • Small numbers are more common than large numbers
  • Values are distributed by a power law

Golomb Code

Each term gets its own custom encoding scheme.

For situations where we only care about contains or not contains and not the number of times a term appears in a document.

Retrieval

Boolean Retrieval

Query is formed using boolean operators (AND OR NOT)

Term at a time

For each term, generate sets of documents => only holds 2 sets in memory at a time.

Document at a time

Check if document passes the query. Since the documents are in sorted order, modify merge operation.

Ranked Retrieval

Requires relevance function => sort by relevance

One way to be relevant

  • terms that occur many times in one document should have high weights
  • terms that occur many times in the entire collection should have low weights

We need:

  • term frequency (in document)
  • document frequency (number of documents)

TF-IDF (Term Frequency and Inverse Document Frequency)

Document at a time

For each document:

  • score each query term, and add them all up
  • accumulate best k hits (min-heap)

Pros: time is O(nlogk) space is O(k) Cons: cannot terminate early

Term at a time

  • Collect hits and ranking for rarest term into accumulator
  • for each other term in the query:
    • if a document does not have that term - remove from accumulator
    • otherwise, add next term’s ranking to overall ranking

Pros: can terminate early Cons: uses a lot of memory