Friday, November 7, 2008

web search engines, parts 1 and 2

Part 1:
--went from the belief that webpages couldn't be indexed (1995) to very reliable search engines, such as google, yahoo, etc.
--generic search engine infrastructure--multiple, geographically centered data structures
--crawling algorithms process requests and continue until the queue is empty
--real crawlers must address: speed, politeness, excluded content, continuous crawling, spam rejection, and duplicate content

Part 2:
Indexing Algorithms:
--uses and inverted file: two step process including 1) scanning and 2) inversion
Issues with real indexers:
--scaling up: simply too many entries
--term lookup: search terms extend beyond the basic english dictionary to include numbers, characters, email addresses, etc.
--compression
--phrases
--anchor text
--link popularity score
-- query independent score

query processing algorithms:
most common= type that don't include operator words
Speeding up queries:
skipping items
early termination--sort the information as you search
caching

No comments: