Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Phrase mining from massive text and its applications
Liu J., Shang J., Koutra D., Morgan & Claypool Publishers, San Rafael, CA, 2017. 90 pp. Type: Book (978-1-627058-98-8)
Date Reviewed: Jan 4 2018

The extraction of what the authors call “quality phrases” from text is the topic of discussion in this monograph. To get an idea of what is meant by “quality” for a phrase, consider the three-word phrase “relational database system.” A simple system that extracts short phrases from a text containing this phrase will also extract “relational database” and “database system.” Moreover, if one simply counts occurrences, these two-word phrases will be weighed as heavily as the three-word phrase that includes them both. Thus, it is desirable to find a principled way that allows the longer phrase to be weighed more heavily. The monograph is devoted to ways in which this can be done.

To select quality phrases, some measure of quality is needed; in separate chapters, the authors describe two approaches to doing this. In the first approach, human intervention is used to supervise the mining process. The authors call this algorithm SegPhrase+. The quality of a phrase is measured by concordance among the subunits of a phrase; informativeness, which eliminates stop phrases and uses inverse document frequency; and completeness. An incomplete phrase would be “NP hard in the strong” as opposed to “NP hard in the strong sense” (recall that NP hard means non-deterministic polynomial time and is a measure of computational complexity). The steps in this method consist of first generating frequent phrase candidates, and then estimating their quality using the labeled good and bad phrases provided by the user. The third step is to estimate the rectified frequency; segmentation-based features are added, and the second and third steps are repeated. Finally, the phrases are filtered to remove ones with low frequency.

In the second approach, which describes an algorithm called AutoPhrase+, the initial quality phrases are obtained by using texts such as wiki texts, where certain phrases are already highlighted in some way, for example by being headers or serving as links. Since this technique will recover single-word phrases, concordance cannot be used as a quality measure, so independence is used. This has the effect that “united” is discounted because it usually occurs within a phrase like United States. AutoPhrase+ makes use of part-of-speech information in phrase construction, but no use is made of ontologies. AutoPhrase+ requires only a knowledge base (Wikipedia was used in the case study), a tokenizer, and part-of-speech information so that it can be built for many languages.

For each of these approaches, the authors describe the algorithms needed to set up the system, along with variations that serve to improve each system. In each case, the efficiency of the systems is discussed. Both types of system achieve linear time performance in terms of the size of the texts. The effects of the variations are also quantified. Example experimental results are given for each technique. Indeed, the authors provide extensive examples to illustrate the ideas behind each step in the systems.

The final section of the monograph describes some applications of the two systems, comparing their effectiveness. The applications are a latent key phrase inference, which handles the issue raised by the use of synonyms, thus finding “doctor” as significant as “physician.” A second application considers topic exploration for a document collection, and thus applies the authors’ ideas to multiple texts. A third application is to knowledge base construction. A final section discusses future research.

The monograph provides a promising template for retrieval systems, particularly AutoPhrase+ for its use of existing texts to weight phrases. It is recommended for graduate students and others interested in seeing examples of mining for phrases that can characterize the contents of documents.

Reviewer:  J. P. E. Hodgson Review #: CR145750 (1803-0129)
Bookmark and Share
  Featured Reviewer  
 
Data Mining (H.2.8 ... )
 
 
Text Analysis (I.2.7 ... )
 
 
Knowledge Representation Formalisms And Methods (I.2.4 )
 
Would you recommend this review?
yes
no
Other reviews under "Data Mining": Date
Feature selection and effective classifiers
Deogun J. (ed), Choubey S., Raghavan V. (ed), Sever H. (ed) Journal of the American Society for Information Science 49(5): 423-434, 1998. Type: Article
May 1 1999
Rule induction with extension matrices
Wu X. (ed) Journal of the American Society for Information Science 49(5): 435-454, 1998. Type: Article
Jul 1 1998
Predictive data mining
Weiss S., Indurkhya N., Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998. Type: Book (9781558604032)
Feb 1 1999
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy