Information Retrieval (IR) in the Web requires us to consider both content and linking. A well-known application of Web IR is a web search engine. The heart of a web search engine is the web crawler - a software agent that traverses the hypertext structure of the Web automatically, starting from an initial hyper-document or a set of starting points (seeds) and recursively retrieving all documents referenced by that document. The visiting strategy of the crawler characterises the purpose of the search engine. Contrary to large-scale, generalized engines (Google, Teoma, etc.) which try to index all the Web, vertical search engines only cater for topical resource discovery and they offer better precision.
The challenging task is ordering the links in the Crawl Frontier efficiently. The importance metrics for the crawling can be either interest driven where the classifier for document similarity checks the text content and popularity/location driven where the importance of a page depends on the hyperlink structure of the crawled document.
Topical web IR has various issues:
Our Method (Hyper Content Latent Analysis HCLA)
We have developed a novel Latent Semantic Indexing (LSI)- based classifier that combines link- and text- analysis in order to retrieve and index domain specific web documents. We assert that it is easier to build an offline text document dataset than an offline web graph since this way we avoid data freshness issues. Furhermore, the vetical search engine's topic can be easily expressed using a text query (a few keywords).
In our method, HCLA, we extend the usual "bag-of-words" vector space model representation by considering both word terms and links for document relevance. In the new space (expanded term-document matrix C), each web document is represented by both the terms it contains and its hyperlinks - it is the concatenation of its terms and its outlinks. Text documents (and queries) are similariy expanded. This permits us to rank candidates in CF using full information from AF and text information from an offline text corpus against the user's text query input using LSI. SVD-updating is used as an inexpensive method of handling newly inserted documents. The CF reordering procedure occurs every N web documents fetched (BSFSN strategy).
Small-scale experiments attest that HCLA allows to overcome the limitations of the initial training data while maintaining a high recall/precision ratio. Evaluation has been done on WebKB and Cora datasets comparing HCLA against well-known methods such as BRFS, BackLink count (BL), Shark Search (SS1 & SS2) and PageRank (PR).
Related research in the lab also includes methods such as Probabilistic Latent Semantic Indexing (PLSI) for both text and web retrieval (PHITS). A novel incremental updating scheme for PLSI (Recursive PLSI, RPLSI) has been developed. Its main benefits are:
First experimental results are promising - RPLSI is attested to have lower mean square error than PLSI folding-in.
G. Almpanidis and C. Kotropoulos, "Combined text and link analysis for focused crawling", in Proc. Int. Conf. Advances in Pattern Recognition (ICAPR 2005), vol. LNCS 3686, part I, pp. 278-287, Bath, U.K., August, 2005.
Herakleitos - (Operational programme for Education and Initial Vocational Training - 3rd Community Support Framework): Processing of Multimedia Signals.
Cluster analysis (or Clustering), which is widely used in many computer science areas, such as data mining, information retrieval, machine learning and image segmentation, studies algorithms and methods for grouping or classifying ”similar” objects into clusters. The notion of ”similarity” varies depending on the purpose of the study, the domain-specific assumptions and the prior knowledge of the problem. The objects (individuals, cases, or subjects) which comprise the raw data in cluster analysis are described either by a set of measurements or by relationships between them.
The unsupervised clustering methods are classified into:
Word Clustering aims to produce groups of words with similar characteristics, such as meaning, syntactical and grammatical information.
Several hierarchical clustering algorithms have been implemented and applied in the field of word clustering:
The use of Long-Distanced Bigrams models in word clustering by minimizing the loss of mutual information comprises our novelty together with an improvement of the clustering algorithm in using robust estimates for the mean and covariance matrix.
Representative experimental results have been conducted on a subset of the Reuters corpus collection.
Cluster validity assessment have been performed by using:
while the performance of the language models employed have been tested by means of perplexity.
N. Bassiou and C. Kotropoulos, "Interpolated Distanced Bigram Language Models for Robust Word Clustering", in Proc. of IEEE Int. Workshop on Nonlinear Signal and Image Processing (NSIP 2005) , Sapporo, Japan, 18-20 May, 2005.
N. Bassiou, C. Kotropoulos and I. Pitas, "Hierarchical word clustering for relevance judgements in information retrieval", in Proc. of Workshop on Pattern Recognition in Information Systems (PRIS'01), pp. 139-148, Setubal, Portugal, 6-7 July, 2001.
MUSCLE - “Multimedia Understanding through Semantics, Computation and LEarning” (FP6-507752)