Computing Reviews

Code Mining
Holzmann G. IEEE Software36(2):25-29,2019.Type:Article
Date Reviewed: 11/12/19

Can machine learning techniques be applied to software analysis to find bugs? That is, show a machine learning system examples of good and bad programs for training and then ask it to classify new code. The article begins with this question. Though admittedly not a machine learning expert, the author thinks this is unlikely to work; however, I find his reasons unconvincing. Holzmann thinks that although programs are full of patterns obvious to humans, they would not be easy for a learning system to find. He gives an example of goto statements, saying a person could notice these are rare in good programs, but a learning program might not. He goes on to say that even if the program could detect this, it would not be all that useful because a compiler could do that as well. Holzmann does point out that while patterns that are present may be detected, it is of course much harder to detect patterns that are not present.

But arguing against machine learning is not the major point of the paper. Rather, the focus is on patterns that are worth mining and how to find them. The example given is that of for statements, noting that if the termination condition contains < or <=, the increment portion will typically include a + or ++ operator. Conversely, if the termination condition contains > or >=, the increment portion typically has - or --. To verify this condition, 14.9 billion lines of C from the Linux 4.3 distribution are analyzed. Not surprisingly, ~52000 for statements with < or <= have increments containing + or ++, while only 38 have increments containing - or --. Only ~250 for statements with >, >= in the termination condition have incrementing ops of + or ++, while some 2000 for statements with >, >= have decrementing ops of - or --. In both cases, some or most of these exceptions are possible bugs. Because Holzmann is not sure machine learning can discover these sorts of patterns, he asks how they might be discovered.

The remainder of the article introduces Cobra, a code browsing tool (http://spinroot.com/cobra), and shows how it was used to calculate the data on for statements. Cobra can be used interactively and it knows about tokens, not just strings or characters. Cobra is touted as better than static code analyzers: it is much easier to define new types of queries and it is much faster to run over large code bases. A further example shows how to find all the places in the Linux 4.3 distribution that use pointer arithmetic.

Readers may or may not agree about applying machine learning to software code. The real value of this article is its use of the Cobra tool to conduct code mining. Anyone interested in improving programs should find this a good introduction to a new tool for analyzing code and finding bugs.

Reviewer:  Andrew R. Huber Review #: CR146769 (2003-0054)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy