Generating Chunks and Chunk Signatures
for Program Recognition and Fault Localization

Ilene Burnstein
Floyd Saner
Katherine Roberson

Computer Science Department
Illinois Institute of Technology
Chicago, IL 60616
csburnstein@minna.iit.edu

We are developing a blackboard-based system to assist software engineers with the difficult tasks of program recognition and fault localization. In this paper we describe a system knowledge source that identifies candidate chunks in a code module. We introduce the concept of a chunk signature which captures the major identifying characteristics of a candidate chunk. The application of the signature to program plan retrieval from a plan knowledge base is also described.

1.0 Introduction

We are developing an automated system for program recognition/fault localization called BUG-DOCTOR [1]. To build this system we have borrowed techniques from the artificial intelligence community. For example, BUG-DOCTOR's reasoning processes are based on cognitive models for program understanding/fault localization [2]. The system architecture is based on the blackboard model [1], and fuzzy reasoning techniques are proposed to evaluate the results of a matching process required for program plan retrieval.

Previous work in our research area shows: 1) that automation of program recognition/fault localization is facilitated by partitioning the code into simpler units, 2) a knowledge base of program plans that represent stereotypical program patterns is essential to achieve program recognition/fault localization goals, and 3) program plan retrieval can be very inefficient when many plans are stored in a system knowledge base [3]. Our research is directed at addressing these issues.

In this paper we report on a code partitioning knowledge source called the Chunker that produces a set of candidate chunks from the target code. We also introduce the concept of a chunk signature. It is derived from a subset of attributes of the candidate chunks, and represents their major identifying characteristics. The signature is intended for use in a signature matching process, M(sig), to retrieve program plans with similar characteristics from a plan library. Fuzzy reasoning techniques are proposed to evaluate the closeness of the signature match. The retrieved plans that are most similar to the target code will be applied in a subsequent procedure involving the entire set of chunk and plan attributes. This matching, called M(plan), is essential for deep reasoning about the program's intentions and for fault identification.

2.0 Program Recognition and Fault Localization Processes

Our conceptual model for program recognition/fault localization provides the underpinnings for the design of the BUG-DOCTOR system [1,2]. The model identifies a chunking process which is used by programmers for simplification and localization. The process produces identifiable code partitions that can be stored, used as building blocks for higher level concepts, and analyzed for faults. The partitions usually represent units that are simple functional entities; for example, "swap values", "calculate average" [4]. Recognition triggers such as variable and function names play a principal role. They help to partition the code, and stimulate the retrieval of stereotypical patterns known as program plans or cliches stored in the programmer's memory [4]. A synthesis process takes place during recognition, that maps the partitions to elements in the plan to build an internal model of the target code that represents its intent. An analysis process also takes place during debugging. Its goal is to identify the differences between the internal code model and the plan. These differences, represent potential faults. BUG-DOCTOR's knowledge sources model the processes described in this section. In the following sections we describe a chunk preprocessor knowledge source (the Chunker), that partitions the code into candidate chunks. We then introduce the concept of a chunk signature and describe its use as an effective mechanism for program plan retrieval.

3.0 Chunk Preprocessing

The rationale for developing an automated chunking process lies in models of code concept recognition [2,4]. In spite of the wide spread use of the term chunk, researchers in this field have not provided a clear definition for a chunk, nor has the chunking process been successfully automated. We have undertaken this challenging task with the goal of partitioning the code into units that are simpler to identify and debug. The chunk preprocessor knowledge source, the Chunker, identifies what we call candidate chunks by use of a bottom up analysis of the code. The candidate chunks are then used by BUG-DOCTOR for two matching processes:
  1. A chunk signature derived from a subset of attributes of the candidate chunks is matched against program plan signatures (also a subset of attributes), to retrieve those plans whose signatures indicate they are closely related to the target chunks. We call this matching process M(sig).
  2. The full set of attributes from both the candidate chunks and the retrieved programming plans is matched to identify the code intentions (concept recognition) and to localize faults. We call this matching process M(plan).

We begin a discussion of the chunking process by giving a definition to the term chunk. Although the term is used extensively in the literature it is not precisely defined. It has been compared to a cliche, a plan, a schema, a slice and a prime [4]. An intuitive definition of a program chunk is [5]:

Definition 1: A chunk is a sequence of software instructions that achieves a coherent purpose and that can be understood outside of the context in which it is used.

The key issues in this definition are that the chunk has a purpose and that it can be understood independently from the program in which it is found. The so-called purpose is extracted from the symbolic representation in the code through a mapping to the problem domain.

A more formal definition of a chunk is stated below [5]. It is based on language semantics and the concept of a programming prime. Program statements can be classified as one of three prime types: sequential, iterative and conditional. For example, an assignment statement is a sequential prime, a "while" loop is an iterative prime, and an "if" statement. is a conditional prime.

Definition 2: A chunk is either a prime, including all primes contained within it, or a sequence of primes that exist within the same programming scope, where for each pair of primes either one prime is data dependent on the other, or both primes are data dependent on a third prime within the sequence .

The property of cohesiveness in the intuitive definition 1 is supported by control and data dependency attributes in the less abstract definition 2. Definition 2 provides the groundwork for designing an automated process to partition a piece of code and identify its chunks. Before one can use this definition another important chunk property must be considered; its hierarchical nature. A chunk may represent a simple domain or programming concept, but it may be nested inside of another chunk that represents a more abstract concept, and it may also have simpler chunks within it.

Our approach to applying this definition has been to group statements together within the same control level and to analyze their data dependencies. We cannot guarantee that the Chunker will find all the logical chunks, but one would assume from the intuitive and formal definitions of a chunk that multiple statements in the same level of program scope working on a common set of variables probably implement a concept or achieve some intended goal that relates to the data. We can build a chunk hierarchy from the set of scopes and the chunks we find within them.

4.0 The Chunking Process

The intuitive definition of a chunk states that all the primes within a chunk work on the same data elements. One would want to mark chunk boundaries at locations where the fewest data elements are shared; the statement grouping would be where many data elements are shared. A good model for supporting these concepts is the data dependency graph [5]. It can be described as:
G = (N,D,A)
where:
N is a set of nodes corresponding to the primes in the program
D is the set of data used by the nodes
A is the set of arcs representing the relationships between the nodes and data.
Each arc is an ordered pair consisting of a node n and a data element d. The direction of the data flow is indicated by the order of the pair. Arcs written as (n,d) indicate node n writes the data element d, and (d,n) indicates node n reads data element d. Nodes are connected only through data. The chunking algorithm implemented in the knowledge source called the Chunker uses data dependency graphs to generate candidate chunks as described below [5]. A tool that generates the dependency graphs has been developed by Professor Bogdan Korel at IIT. It has been made available to us, and has facilitated the implementation of the Chunker.

4.1 The Chunking Algorithm

  1. Tag each statement with a unique label indicating its nesting level. Proceed iteratively until all the statements are tagged. Each tag corresponds to a prime or node in the graph, as shown in the sample code in Figure 1.
  2. Data flow analysis. Identify live (L) variables, and dead variables (D); those written to (W), those that are read, (R). When variables go out of scope they are marked undefined (U).
  3. Generate the top level dependency graph and the sub graphs for each level. Build the graphs; G = (N,D,A), that include the set of nodes for representing the statements at a control level. Any data read or modified by these statements is in the set of data, and the relationships between the nodes become arcs. Variables that are reused according to our criteria are included in the graph as separate instances.
  4. Apply transformations to reduce the complexity of the graphs. One transformation that can be applied is to collapse nodes that do not read data, but write data that is only used by one other node, into the node that uses the data. For example statements that initialize counters and accumulators for loops can be combined into the same logical chunks as the loop.
  5. Identify potential chunks from the reduced dependency graphs and use a set of heuristics to prune them and generate candidate chunks. There is an iterative component to this process; if the candidate chunks produced from the application of a heuristic do not have any counterparts in the plan library, alternative candidate chunks can be chosen by applying another heuristics to the list of potential chunks.

System rules based on chunk definition 2, allow any single node in the graph to be considered a potential chunk. Any sub graph within the reduced graph is also a potential chunk, provided that each node in the sub graph is related to at least one other node in the graph via common data. During processing, the potential chunks are listed along with the data elements they read and write. Each potential chunk is viewed as a function with the input parameters being those variables read, and the output parameters being those variables written.

Once the potential chunks are generated heuristics are used to select the most likely candidate chunks from the list. The heuristics are useful for refining the choice of chunk boundaries so that the resulting candidate chunks resemble those most likely to be chosen by a programmer. More importantly, the refinement process makes it more likely that the candidate chunk will be found in our library of program plans. If no nominees are found in the plan library, we have the opportunity to generate alternative chunk configurations using the list of potential chunks and the set of heuristics. For the code shown in figure 1 the Chunker generates the candidate chunks shown in figure 2:

        Source Code                             Data Access

        real std(int a[],int n) {               a:L,n:L
        real average, deviation, sqrDev
A       int i = 0                               i: W,L
B       real sum = 0                            sum: W,L
C       while(i < n) {                          i:R,n:R
C1              sum = sum + a[i]                sum:R,W,a:R,i:R
C2              i = i+1                         i:R,W
C               }                               i:D
D       average = sum/n                         average:W,L,
                                                sum:R,D,n:R
E       i = 0                                   i:WL
F       sum = 0                                 sum:WL
G       while(i < n) {                          i:R, n:R
G1              deviation = a[i] - average      deviation:WL,a:R,
                                                average:R
G2              sqrDev = deviation*deviation    sqrDev:WL,
                                                deviation:R,D
G3              sum = sum + sqrDev              sum:W,sqrDev:R,D
G4              i = i+1                         i:RW
G               }                               i:D,average:D
H       average = sum/n                         average:WL,
                                                sum:RD,n:R
I       deviation = sqrt(average)               deviation:WL,
                                                average:RD
J       return(deviation)                       deviation:RD
        }                                       a:R,n:R,result:W

Figure 1:  Tagged Source Code with Data Access

For this example we have applied the heuristic, "candidate chunks are the smallest potential chunks that produce a single output variable" [5]. It partitions the code in a functional way; a chunk is a unit that produces a single result. To apply this heuristic the Chunker finds all output variables produced by the potential chunks. The result is two candidate chunks that produce "sum", two candidate chunks that produce "average", one that produces "deviation" and one that produces a result. These candidate chunks are shown in Figure 2. We can also use the same techniques to generate the lower level dependency graphs to find the candidate chunks at the lower levels of control. In this way we build a candidate chunk hierarchy [5].

 Candidate Chunk        Input Variables         Output Variables

{D}                     sum,n                   average
{H}                     sum',n                  average'
{I}                     average'                deviation
{J}                     deviation               result
{A,B,C}                 a,n                     sum
{E,F,G}                 a,n,average             sum'

Figure 2.  Candidates Chunk at Top Level of Standard Deviation

We are currently running the Chunker over a large variety of programs to help us to gain experience in tuning the graphs and in selecting appropriate heuristics. We also want to learn if other transformations could be applied to the graph to simplify the analysis.

5.0 The Role of the Candidate Chunks and Chunk Signatures in the Matching Process

In order to recognize high level concepts in the target code and localize faults, BUG-DOCTOR must use pattern matching techniques. Pattern matching requires a knowledge base of program plans that represent stereotypical code patterns. BUG-DOCTOR currently has a very small knowledge base called a programming library that consists of plans for basic techniques such as sorting, averaging, and standard deviation. Each plan, called a knowledge object (KO), is associated with a set of attributes.

We introduce the concept of a chunk signature to address a major problem associated with recognition/fault localization systems: efficient retrieval of relevant plans from a plan library. The library might conceivably consist of thousands of plan instances that could be matched against the target code in order to recognize its intentions [3]. We want to match only those with similar characteristics. We say that a chunk signature consists of a subset of attributes that represent a chunks' most recognizable qualities, and argue that it can be used in an efficient manner to retrieve a pertinent subset of plans from a plan library. We say, if (CS) is a chunk signature, and (PS) is a plan signature, then signature matching:

M(sig)(chunk,plan) = (CS) R (PS)
where R can be an equivalence or implication relation. Using M(sig), a set of knowledge objects could be retrieved whose key attributes are similar to those of the candidate chunks. These would then be used for the computationally intense recognition and debugging processes prescribed by M(plan).

The signature approach for plan retrieval is supported by work in code reuse where signatures have been used to retrieve components from a reuse library [6]. Support also comes from the work of Quilici in program recognition [7]. He describes a similar concept for plan retrieval that he calls an index. We believe his work needs significant modification, since domain vocabulary items were not included in the index computations. We include these in our chunk signature and argue below that vocabulary items are essential for efficient retrieval of relevant plans from the many problem domains that may exist in a practical plan library. We also add a "closeness" measure to evaluate the results of the signature match, M(sig).

We now describe the requirements for a chunk signature:

  1. It must contain a subset of attributes that most readily identify the problem and programming task domain of the chunk.
  2. It must be implementable at different levels of detail.
  3. For input code consisting of many chunks, a top level signature must be expressible as some combination of the component signatures of all its included chunks.
The use of recognition triggers as identified in empirical studies suggests the attribute group shown below [4]:

To illustrate our approach we use the standard deviation code in Figure 1. We generate the signature components below for this chunk which at the top level consists of the entire module. The components below are listed from general to specific. Note that the signature is based on two principal types of control structures, iterators and selectors, as well as levels of nesting, and domain vocabulary. Since we are handling buggy code, we make the reasonable assumption that a "competent" programmer will have the overall control structure close to correct.

Signature for Standard Deviation:

  1. Deepest level of nesting regardless of control structure type = {0}.
    (All iterators and selectors in this code example occur at the outermost level)
  2. Type of control structure present = {iterator}
  3. Number of instances of control structures = {iterator, 2}
  4. Complete control structure description = {iterator, level 0}, {iterator, level 0}
  5. Domain vocabulary = {sum, deviation, average, sqrDev, sqrt, real}

Possible search cases could be: a) search on 4, and 5; these are likely candidates, b) if case "a" fails or subsequent M(plan) matching is not successful, search on 4; perhaps the programmer did not use meaningful vocabulary, c) if case "b" fails, search on 5 and 3; look for KO's that contain the domain vocabulary, and have two iterators, which may not be on level 0. Continue the search at lower levels of detail until a suitable plan(s) is found. It is also possible that the search may terminate since a suitable plan may not exist.

It is also likely that signature matches may be partial since the code we are processing is buggy and may also be idiosyncratic. For example in case "a" above, there may be a complete match on control structures, but a partial match on vocabulary. This points out the need for a factor to evaluate the closeness of the match and to help define the nature of implication aspect of the relation R. We are working on a modified fuzzy reasoning approach to address this issue. The procedure below illustrates how a closeness factor could be generated and applied.

Step 1.
Generate set of signatures for chunks of target code.
Step 2.
Use the signatures and M(sig) to retrieve nominee plans from the programming library.
Step 3.
For each nominee, generate a set of preliminary factors that indicates the closeness of the match between the target code and the nominee for each attribute in the signature.
Step 4.
Convert the factors to fuzzy values and submit them to a fuzzy reasoning system.
Step 5.
Return a composite fuzzy matching value for each nominee that better indicates the closeness of the match.
Step 6.
Proceed to match (as in M(plan)), the target code with plans that have the best fuzzy valued matching factor.

6.0 Summary

Our research addresses some of the difficult problems associated with knowledge based program recognition/fault localization. To address some of these issues we provide a definition for a program chunk, and describe a tool that identifies candidate chunks in a code module. We introduce the concept of a chunk signature, and show how it can be used for efficient retrieval of program plans from a plan library. A fuzzy reasoning approach is described to evaluate the closeness of the retrieved plans to the target code chunks. We are currently working on a simple implementation of M(sig) to test the utility of our ideas.


References

[1] I. Burnstein, N Jani, S. Mannina, J. Tamsevicious, M. Goldshteyn, L. Lendi. The Development of a Knowledge Based Software Fault Localization Tool, Proc. Internat'l. Conf. on Systems, Man and Cybernetics, Chicago IL, Oct. 1992, pp. 317-322.

[2] I. Burnstein, S. Mannina. Developing a Conceptual Model for the Software Fault Localization Process, Proc. Fourth Midwest AI and Cognitive Science Soc. Conf., Utica IL, May 1992, pp. 21-25.

[3] C. Rich, R. Waters. The Programmers Apprentice, Addison Wesley, Reading MA, 1990.

[4] N. Pennington, B. Grabowski. The Tasks of Programming, in Psychology of Programming, J. Hoc, T. Greene, D. Gilmore, eds., London, Academic Press, 1990.

[5] K. Roberson. Identifying Candidate Chunks for Program Recognition and Debugging, Technical Report, Illinois Institute of Technology, September 1995.

[6] E. Ostertag, J. Hendler, R. Diaz, C. Braun. Computing Similarity in a Reuse Library System: An AI-Based Approach, ACM Transactions on Software Eng. and Methodology, Vol. 1, No. 3, 1992, pp. 205-228.

[7] A. Quilici. A Memory-Based Approach to Recognizing Program Plans, CACM, Vol. 37, No. 5, 1994, pp. 84-93.