04 - Text Search

Probabilistic Model

P(“I saw a van”) = P(“I”) x P(“saw” | “I”) x P(“a” | “I saw”) x P(“van” | “I saw a”)

This is not possible because the size of a sentence is unbounded.

Smaller Limit: N-Gram

Probability of next word only depends on the previous N-1 words.

Unigram: $P (w) = C (w) / N$ Bigram: $P (w_{i}, w_{j}) = C (w_{i}, w_{j}) / N$ $P (w_{j} ∣ w_{i}) = P (w_{i}, w_{j}) / P (w_{i}) = C (w_{i}, w_{j}) / C (w_{i})$

If a single n-gram in the sentence has never been seen, P=0, the entire probability becomes 0. One unusual word can make a sentence from “likely” to “impossible”.

Solution: we can remove 0s - Laplace Smoothing

start each count at 1

+ $V^{2}$ because every pair of words has a +1, V = vocabulary size

Other Smoothing Techniques

Good-Turing
Katz backoff
Jelinek Mercer
Dirichlet Smoothing
Kneser-Ney

Searching - Information Retrieval

Inverted Index

Map context to documents

Ex:

blue -> 2
cat -> 3
fish -> 1 -> 2

Key is word, value is index of the document they appear in

Zipf’s Law

A few elements occur very frequently
Many elements occur very infrequently

Delta Compression

if a term is rare => not many postings
if a term is common => average delta is small

Hence we can use delta encoding (gap encoding)

Ex:

1, 6, 11, 15, 22 // sequence
1, 5, 5, 4, 7 // gaps

This saves space because we can use other datatypes (not int)

VInt (variable-width integer type) uses 1-5 bytes to represent an int.

Also, use BytesWritable instead of ArraysWritable[VInt] because array storing is heavy.

WORD Level: Simple 9 - Simple 14 (different ways of storing a variable number of values in a single word) BYTE Level: VInt - VLong BIT Level: Elias Code

Elias Gamma Code

Assumptions:

Natural numbers with no upper bound
Small numbers are more common than large numbers
Values are distributed by a power law

Golomb Code

Each term gets its own custom encoding scheme.

For situations where we only care about contains or not contains and not the number of times a term appears in a document.

Retrieval

Boolean Retrieval

Query is formed using boolean operators (AND OR NOT)

Term at a time

For each term, generate sets of documents => only holds 2 sets in memory at a time.

Document at a time

Check if document passes the query. Since the documents are in sorted order, modify merge operation.

Ranked Retrieval

Requires relevance function => sort by relevance

One way to be relevant

terms that occur many times in one document should have high weights
terms that occur many times in the entire collection should have low weights

We need:

term frequency (in document)
document frequency (number of documents)

TF-IDF (Term Frequency and Inverse Document Frequency)

Document at a time

For each document:

score each query term, and add them all up
accumulate best k hits (min-heap)

Pros: time is O(nlogk) space is O(k) Cons: cannot terminate early

Term at a time

Collect hits and ranking for rarest term into accumulator
for each other term in the query:
- if a document does not have that term - remove from accumulator
- otherwise, add next term’s ranking to overall ranking

Pros: can terminate early Cons: uses a lot of memory

Xin Cheng Shen

Table of Contents

04 - Text Search

Probabilistic Model

Searching - Information Retrieval

Inverted Index

Retrieval

Boolean Retrieval

Ranked Retrieval

Graph View

Recent Notes

Xin Cheng Shen

06 - Server Design

07 - Distributed Naming

Backlinks

Xin Cheng Shen

Table of Contents

04 - Text Search

Probabilistic Model §

Searching - Information Retrieval §

Inverted Index §

Retrieval §

Boolean Retrieval §

Ranked Retrieval §

Graph View

Recent Notes

Xin Cheng Shen

06 - Server Design

07 - Distributed Naming

Backlinks

Probabilistic Model

Searching - Information Retrieval

Inverted Index

Retrieval

Boolean Retrieval

Ranked Retrieval