We are developing a blackboard-based system to assist software engineers with the difficult tasks of program recognition and fault localization. In this paper we describe a system knowledge source that identifies candidate chunks in a code module. We introduce the concept of a chunk signature which captures the major identifying characteristics of a candidate chunk. The application of the signature to program plan retrieval from a plan knowledge base is also described.
Previous work in our research area shows: 1) that automation of program recognition/fault localization is facilitated by partitioning the code into simpler units, 2) a knowledge base of program plans that represent stereotypical program patterns is essential to achieve program recognition/fault localization goals, and 3) program plan retrieval can be very inefficient when many plans are stored in a system knowledge base [3]. Our research is directed at addressing these issues.
In this paper we report on a code partitioning knowledge source called the Chunker that produces a set of candidate chunks from the target code. We also introduce the concept of a chunk signature. It is derived from a subset of attributes of the candidate chunks, and represents their major identifying characteristics. The signature is intended for use in a signature matching process, M(sig), to retrieve program plans with similar characteristics from a plan library. Fuzzy reasoning techniques are proposed to evaluate the closeness of the signature match. The retrieved plans that are most similar to the target code will be applied in a subsequent procedure involving the entire set of chunk and plan attributes. This matching, called M(plan), is essential for deep reasoning about the program's intentions and for fault identification.
We begin a discussion of the chunking process by giving a definition to the term chunk. Although the term is used extensively in the literature it is not precisely defined. It has been compared to a cliche, a plan, a schema, a slice and a prime [4]. An intuitive definition of a program chunk is [5]:
Definition 1: A chunk is a sequence of software instructions that achieves a coherent purpose and that can be understood outside of the context in which it is used.
The key issues in this definition are that the chunk has a purpose and that it can be understood independently from the program in which it is found. The so-called purpose is extracted from the symbolic representation in the code through a mapping to the problem domain.
A more formal definition of a chunk is stated below [5]. It is based on language semantics and the concept of a programming prime. Program statements can be classified as one of three prime types: sequential, iterative and conditional. For example, an assignment statement is a sequential prime, a "while" loop is an iterative prime, and an "if" statement. is a conditional prime.
Definition 2: A chunk is either a prime, including all primes contained within it, or a sequence of primes that exist within the same programming scope, where for each pair of primes either one prime is data dependent on the other, or both primes are data dependent on a third prime within the sequence .
The property of cohesiveness in the intuitive definition 1 is supported by control and data dependency attributes in the less abstract definition 2. Definition 2 provides the groundwork for designing an automated process to partition a piece of code and identify its chunks. Before one can use this definition another important chunk property must be considered; its hierarchical nature. A chunk may represent a simple domain or programming concept, but it may be nested inside of another chunk that represents a more abstract concept, and it may also have simpler chunks within it.
Our approach to applying this definition has been to group statements together within the same control level and to analyze their data dependencies. We cannot guarantee that the Chunker will find all the logical chunks, but one would assume from the intuitive and formal definitions of a chunk that multiple statements in the same level of program scope working on a common set of variables probably implement a concept or achieve some intended goal that relates to the data. We can build a chunk hierarchy from the set of scopes and the chunks we find within them.
G = (N,D,A)where:
N is a set of nodes corresponding to the primes in the programEach arc is an ordered pair consisting of a node n and a data element d. The direction of the data flow is indicated by the order of the pair. Arcs written as (n,d) indicate node n writes the data element d, and (d,n) indicates node n reads data element d. Nodes are connected only through data. The chunking algorithm implemented in the knowledge source called the Chunker uses data dependency graphs to generate candidate chunks as described below [5]. A tool that generates the dependency graphs has been developed by Professor Bogdan Korel at IIT. It has been made available to us, and has facilitated the implementation of the Chunker.
D is the set of data used by the nodes
A is the set of arcs representing the relationships between the nodes and data.
System rules based on chunk definition 2, allow any single node in the graph to be considered a potential chunk. Any sub graph within the reduced graph is also a potential chunk, provided that each node in the sub graph is related to at least one other node in the graph via common data. During processing, the potential chunks are listed along with the data elements they read and write. Each potential chunk is viewed as a function with the input parameters being those variables read, and the output parameters being those variables written.
Once the potential chunks are generated heuristics are used to select the most likely candidate chunks from the list. The heuristics are useful for refining the choice of chunk boundaries so that the resulting candidate chunks resemble those most likely to be chosen by a programmer. More importantly, the refinement process makes it more likely that the candidate chunk will be found in our library of program plans. If no nominees are found in the plan library, we have the opportunity to generate alternative chunk configurations using the list of potential chunks and the set of heuristics. For the code shown in figure 1 the Chunker generates the candidate chunks shown in figure 2:
Source Code Data Access
real std(int a[],int n) { a:L,n:L
real average, deviation, sqrDev
A int i = 0 i: W,L
B real sum = 0 sum: W,L
C while(i < n) { i:R,n:R
C1 sum = sum + a[i] sum:R,W,a:R,i:R
C2 i = i+1 i:R,W
C } i:D
D average = sum/n average:W,L,
sum:R,D,n:R
E i = 0 i:WL
F sum = 0 sum:WL
G while(i < n) { i:R, n:R
G1 deviation = a[i] - average deviation:WL,a:R,
average:R
G2 sqrDev = deviation*deviation sqrDev:WL,
deviation:R,D
G3 sum = sum + sqrDev sum:W,sqrDev:R,D
G4 i = i+1 i:RW
G } i:D,average:D
H average = sum/n average:WL,
sum:RD,n:R
I deviation = sqrt(average) deviation:WL,
average:RD
J return(deviation) deviation:RD
} a:R,n:R,result:W
Figure 1: Tagged Source Code with Data Access
For this example we have applied the heuristic, "candidate chunks are the smallest potential chunks that produce a single output variable" [5]. It partitions the code in a functional way; a chunk is a unit that produces a single result. To apply this heuristic the Chunker finds all output variables produced by the potential chunks. The result is two candidate chunks that produce "sum", two candidate chunks that produce "average", one that produces "deviation" and one that produces a result. These candidate chunks are shown in Figure 2. We can also use the same techniques to generate the lower level dependency graphs to find the candidate chunks at the lower levels of control. In this way we build a candidate chunk hierarchy [5].
Candidate Chunk Input Variables Output Variables
{D} sum,n average
{H} sum',n average'
{I} average' deviation
{J} deviation result
{A,B,C} a,n sum
{E,F,G} a,n,average sum'
Figure 2. Candidates Chunk at Top Level of Standard Deviation
We are currently running the Chunker over a large variety of programs to help us to gain experience in tuning the graphs and in selecting appropriate heuristics. We also want to learn if other transformations could be applied to the graph to simplify the analysis.
We introduce the concept of a chunk signature to address a major
problem associated with recognition/fault localization systems: efficient
retrieval of relevant plans from a plan library. The library might conceivably
consist of thousands of plan instances that could be matched against the
target code in order to recognize its intentions [3].
We want to match only
those with similar characteristics. We say that a chunk signature consists of
a subset of attributes that represent a chunks' most recognizable qualities,
and argue that it can be used in an efficient manner to retrieve a pertinent
subset of plans from a plan library. We say, if (CS) is a chunk signature,
and (PS) is a plan signature, then signature matching:
M(sig)(chunk,plan) = (CS) R (PS)where R can be an equivalence or implication relation. Using M(sig), a set of knowledge objects could be retrieved whose key attributes are similar to those of the candidate chunks. These would then be used for the computationally intense recognition and debugging processes prescribed by M(plan).
The signature approach for plan retrieval is supported by work in code reuse where signatures have been used to retrieve components from a reuse library [6]. Support also comes from the work of Quilici in program recognition [7]. He describes a similar concept for plan retrieval that he calls an index. We believe his work needs significant modification, since domain vocabulary items were not included in the index computations. We include these in our chunk signature and argue below that vocabulary items are essential for efficient retrieval of relevant plans from the many problem domains that may exist in a practical plan library. We also add a "closeness" measure to evaluate the results of the signature match, M(sig).
We now describe the requirements for a chunk signature:
To illustrate our approach we use the standard deviation code in Figure 1. We generate the signature components below for this chunk which at the top level consists of the entire module. The components below are listed from general to specific. Note that the signature is based on two principal types of control structures, iterators and selectors, as well as levels of nesting, and domain vocabulary. Since we are handling buggy code, we make the reasonable assumption that a "competent" programmer will have the overall control structure close to correct.
Signature for Standard Deviation:
Possible search cases could be: a) search on 4, and 5; these are likely candidates, b) if case "a" fails or subsequent M(plan) matching is not successful, search on 4; perhaps the programmer did not use meaningful vocabulary, c) if case "b" fails, search on 5 and 3; look for KO's that contain the domain vocabulary, and have two iterators, which may not be on level 0. Continue the search at lower levels of detail until a suitable plan(s) is found. It is also possible that the search may terminate since a suitable plan may not exist.
It is also likely that signature matches may be partial since the code we are processing is buggy and may also be idiosyncratic. For example in case "a" above, there may be a complete match on control structures, but a partial match on vocabulary. This points out the need for a factor to evaluate the closeness of the match and to help define the nature of implication aspect of the relation R. We are working on a modified fuzzy reasoning approach to address this issue. The procedure below illustrates how a closeness factor could be generated and applied.
[2] I. Burnstein, S. Mannina. Developing a Conceptual Model for the Software Fault Localization Process, Proc. Fourth Midwest AI and Cognitive Science Soc. Conf., Utica IL, May 1992, pp. 21-25.
[3] C. Rich, R. Waters. The Programmers Apprentice, Addison Wesley, Reading MA, 1990.
[4] N. Pennington, B. Grabowski. The Tasks of Programming, in Psychology of Programming, J. Hoc, T. Greene, D. Gilmore, eds., London, Academic Press, 1990.
[5] K. Roberson. Identifying Candidate Chunks for Program Recognition and Debugging, Technical Report, Illinois Institute of Technology, September 1995.
[6] E. Ostertag, J. Hendler, R. Diaz, C. Braun. Computing Similarity in a Reuse Library System: An AI-Based Approach, ACM Transactions on Software Eng. and Methodology, Vol. 1, No. 3, 1992, pp. 205-228.
[7] A. Quilici. A Memory-Based Approach to Recognizing Program Plans, CACM, Vol. 37, No. 5, 1994, pp. 84-93.