Web Information Retrieval

Information Retrieval (IR) in the Web requires us to consider both content and linking. A well-known application of Web IR is a web search engine. The heart of a web search engine is the web crawler - a software agent that traverses the hypertext structure of the Web automatically, starting from an initial hyper-document or a set of starting points (seeds) and recursively retrieving all documents referenced by that document. The visiting strategy of the crawler characterises the purpose of the search engine. Contrary to large-scale, generalized engines (Google, Teoma, etc.) which try to index all the Web, vertical search engines only cater for topical resource discovery and they offer better precision.

Search Engines
Generalized Engines Specialized Engines
Examples Google, Teoma. Krugle, Simply Hired.
Purpose General purpose. Try to cover all the Web. Targeted, focused, topical. Index a small portion of the Web.
Requirements Large scale. Vast requirements in storage, resources. Small scale. Need "smart" retrieval and management.
Maintain Full update is difficult. Freshness? Easy to update. Small index.
Quality Good recall. Bad precision. Noise. Good precision. Bad recall.
Visiting Strategy BReadth First Search (BRFS). BeSt First Search (BSFS).


Vertical search engines typically employ a focused or topic-driven crawler, that analyses its crawl boundary in order to find the links that are likely to be most relevant for the crawl while avoiding irrelevant regions of the Web. A focused crawler maintains two queues of URLs:

  • One containing the already visited links (AF)
  • and another having the references of the first queue which is called Crawl Frontier (CF)

The challenging task is ordering the links in the Crawl Frontier efficiently. The importance metrics for the crawling can be either interest driven where the classifier for document similarity checks the text content and popularity/location driven where the importance of a page depends on the hyperlink structure of the crawled document.

Topical web IR has various issues:

  • How is it possible to select the most promising canditate links from the CF when these have not been parsed/distilled yet? We do not have information from their text content and we only have partial hyperlink information. We know some of their ancestors (from AF) but none of their children (outlinks).
  • How do we consider a document in CF as "good"? Do relevant documents point to relevant links?
  • Is it possible to use previous (historic) knowledge (e.g. either text or link information from a previous crawl) to assist the crawling procedure?
  • etc

Our Method (Hyper Content Latent Analysis HCLA)

We have developed a novel Latent Semantic Indexing (LSI)- based classifier that combines link- and text- analysis in order to retrieve and index domain specific web documents. We assert that it is easier to build an offline text document dataset than an offline web graph since this way we avoid data freshness issues. Furhermore, the vetical search engine's topic can be easily expressed using a text query (a few keywords).

In our method, HCLA, we extend the usual "bag-of-words" vector space model representation by considering both word terms and links for document relevance. In the new space (expanded term-document matrix C), each web document is represented by both the terms it contains and its hyperlinks - it is the concatenation of its terms and its outlinks. Text documents (and queries) are similariy expanded. This permits us to rank candidates in CF using full information from AF and text information from an offline text corpus against the user's text query input using LSI. SVD-updating is used as an inexpensive method of handling newly inserted documents. The CF reordering procedure occurs every N web documents fetched (BSFSN strategy).

 

Small-scale experiments attest that HCLA allows to overcome the limitations of the initial training data while maintaining a high recall/precision ratio. Evaluation has been done on WebKB and Cora datasets comparing HCLA against well-known methods such as BRFS, BackLink count (BL), Shark Search (SS1 & SS2) and PageRank (PR).

Related research in the lab also includes methods such as Probabilistic Latent Semantic Indexing (PLSI) for both text and web retrieval (PHITS). A novel incremental updating scheme for PLSI (Recursive PLSI, RPLSI) has been developed. Its main benefits are:

  • It does not require to train the model from the beginning
  • It updates not only the probabilities of the new-document as folding-in does, but also all the probabilities of terms and documents

First experimental results are promising - RPLSI is attested to have lower mean square error than PLSI folding-in.

 

Relevant Publications

G. Almpanidis and C. Kotropoulos, "Combined text and link analysis for focused crawling", in Proc. Int. Conf. Advances in Pattern Recognition (ICAPR 2005), vol. LNCS 3686, part I, pp. 278-287, Bath, U.K., August, 2005.

G. Almpanidis, C. Kotropoulos and I. Pitas, "Focused crawling using latent semantic indexing- An application for vertical search engines", in Proc. 2005 European Conf. Digital Libraries (ECDL 2005), vol. LNCS 3652, pp. 402-413, Vienna, September, 2005.

G. Almpanidis, C. Kotropoulos, and I. Pitas, "Combining Text and Link Analysis for Focused Crawling - An Application for Vertical Search Engines", Information Systems, accepted for publication.

C. Kotropoulos and A. Papaioannou, "A novel-updating scheme for probabilistic latent semantic indexing", in Proc. 4th Panhellenic Artificial Intelligence Conf. (SETN-06), vol. LNAI 3966, pp. 137-147, Heraklion, Greece, May 19-20, 2006.

Research Projects

Herakleitos - (Operational programme for Education and Initial Vocational Training - 3rd Community Support Framework): Processing of Multimedia Signals.

 

© 2006