Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Unsupervised domain ranking in large-scale web crawls
Cui Y., Sparkman C., Lee H., Loguinov D. ACM Transactions on the Web12 (4):1-29,2018.Type:Article
Date Reviewed: Apr 15 2019

The already enormous amount of web content continues to increase. This poses an issue to web crawlers, the tools used by search engines to find content worth indexing for searches. Web crawlers have to find and rank web content using limited resources, that is, central processing unit (CPU) cycles, bandwidth, and so on. Moreover, some of this content is spam, which consumes resources but adds no value to search results. Thus, there is always the need to improve the algorithms used for selecting and ranking web content, so that the available resources are allocated to valuable content, as well as preserving accuracy and manageable overhead to be executed in real time.

This paper presents a technique for crawling the web with reasonable resource use and spam filtering in real time. The methodology to build it is also explained. Its distinguishing characteristic compared with previous algorithms is ranking domains instead of web pages, which is the classical approach. Domains (instead of web pages) are allocated a budget, which is used for ranking. One of its objectives is to avoid allocating domains that produce great amounts of spam to good ranking positions.

The methodology proposed is designed around a comparison of four ranking algorithms; the proposed algorithm is based on the conclusions obtained from this comparison. The authors present some elements of the methodology, including the algorithms, the datasets, the parameters used for comparison, and the steps followed.

The four algorithms are PageRank, OPIC, in-degree (IN) (a previous proposal from these authors), and level 2 supporters (SUPP). Two large academic datasets were used to test the algorithms: IRLbot (6.3B pages) and ClueWeb09 (1B pages). Overhead and accuracy were the parameters used for the comparison, together with the percentage of spam that was able to avoid algorithms filters and get ranked in good positions.

The initial phase of the comparison used Google Toolbar Ranks (GTRs) and public spam lists to help make decisions about the quality of web content. This work was done manually, that is, the authors verified the quality of each domain, so that they could compare the rankings made by the algorithms with a reliable list. The results of the first comparison point to SUPP, a breadth-first search (BFS)-based algorithm, as the winner. The second phase, which includes the authors’ new proposal, compared the algorithms in terms of their ability to place the most valuable domains in top positions (which is not the same as avoiding spam).

The new method proposed is an evolution of SUPP, the winner of the first comparison. It is called top supporters estimation (TSE). With TSE, the authors overcome SUPP’s limitations: it has to be “incorporated into a high performance web crawler,” “the enormous amount of CPU processing” needed to eliminate duplicates in BFS, and “a huge number of random [memory] accesses.”

This paper is for the community working on web crawlers and search engines. Every decision is justified, even those that may seem evident at first glance. It is also helpful for experts who may want to evaluate if this approach is convenient for their interests. The methodology and the results obtained, essential for reproducing the experiments or designing an evolution of the methodology proposed in this paper, are also valuable. Despite the paper’s academic approach, it is clear that such results highlight issues that will guide the evolution of web crawlers, which will interest professionals and companies working in the search engine area.

Reviewer:  Mercedes Martínez González Review #: CR146532 (1907-0283)
Bookmark and Share
 
World Wide Web (WWW) (H.3.4 ... )
 
 
Search Process (H.3.3 ... )
 
 
Web (I.7.1 ... )
 
Would you recommend this review?
yes
no
Other reviews under "World Wide Web (WWW)": Date
Intranet document management
Bannan J., Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1997. Type: Book (9780201873795)
Feb 1 1998
Developing databases for the Web and intranets
Rodley J., Coriolis Group Books, Scottsdale, AZ, 1997. Type: Book (9781576100516)
Jun 1 1998
1001 programming resources
Edward J. J., Jamsa Press, Houston, TX, 1996. Type: Book (9781884133503)
Apr 1 1998
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy