It is increasingly recognised that the ability of the user to exploit the underlying technology of IR systems is critically determined by the user interface provided. The most powerful algorithms for searching and ranking output are of little value if the user is unable to convert their information needs into reliable and valid queries. Improving user interfaces to IR systems however can only result from analyses and tests of real users. In the present paper, we report results of a detailed investigation of several variables that reflect different approaches to IR interface design.
It is clear that the construction of logical and effective search queries is a learned skill that few users of Web environments will ever acquire. From the user-centered perspective of HCI, this provides a stimulus to researchers to devise search tool interfaces that support the psychological tendencies of typical searchers. However, to date, there has been little study of the important issue of user query formulation to guide designers or other researchers, and what is there remains confusing.
While some researchers (e.g., Sein 1993) have examined cognitive style dimension of the user and ability to employ various interfaces effectively, the most important issues affecting user performance seems to be the form of query expression or language employed. For example, Greene et al (1990) studied the difference between queries produced using SQL and queries produced using a language called Truth-table Exemplar-Based Interface (TEBI), which is similar to a Query By Example (QBE) interface. They found a significant improvement in the speed and accuracy of query production using TEBI. Specifically, they found that subjects produced the most accurate queries when they only involved the operator "AND." They also report that "OR alone" queries were harder than "AND alone" queries. "NEGATION" queries were harder than "AND alone" or "OR alone" queries. "AND+OR" queries were harder than all other query types. By verifying subjects' comprehension of the English queries by requiring them to pick out the objects that would be returned by each query from a sample database, they found operator type and language condition to be a weak predictor of verification performance. Hence, they argue that it was not the variance in complexity of query logic alone (at the levels of complexity they studied) that accounted for the variability in subjects' query production performance. Rather, it was how the query logic mapped onto the query language that resulted in the differences.
One specific mapping issue for query languages is procedurality. Greenblatt and Waxman (1978) varied procedurality by testing the ability of subjects to formulate queries using three different query systems: Query By Example (QBE), Structured English Query Language (SEQUEL, a predecessor to SQL) and an algebra-based language. The algebra-based language was more procedural than both QBE and SEQUEL. They found that subjects using QBE formulated the most correct queries in the least time, that subjects using the algebra-based language formulated the most incorrect queries in the longest time, and that the results for SQL fell in between. These results indicate that the non-procedural query languages allow users to best formulate correct queries.
Welty and Stemple (1981) compared the ability of subjects to write queries in two different relational database query languages (SQL and TABLET) that differed primarily with respect to their procedurality. Their study was motivated by the previously unsubstantiated belief that "there is generally a point in the complexity of any class of problems beyond which procedural specification is simpler than nonprocedural" (p. 627).
Welty and Stemple assert, "In general, of two specifications for the same computation, the one contains more operations and variable bindings, more strictly ordered, is the more procedural" (p. 632). They formalize this intuition by proposing the following procedurality metric:
number of variable bindings number of operations
PM = --------------------------- + ---------------------
number of permissible number of permissible
variable binding orderings operation orderings
Using this procedurality metric, Welty and Stemple show that the queries in their experiment were more procedural in TABLET than queries in SQL.
Classifying queries as either "easy" or "hard," Welty and Stemple tested subjects' ability to produce queries with a final test immediately after training in the query language, and with a retention test three weeks after the final. There were no statistically significant differences in the number of correct queries produced using TABLET and SQL on either the final test or rention test for the easy problems.
However, on hard problems, with subjects who had no previous computer programming experience, they found subjects using TABLET, the more procedural language, were significantly better at producing accurate queries than subjects using SQL, on the final test immediately after training. The superior performance of TABLET subjects who were inexperienced in programming suggests that the more procedural language was easier to learn.
Manipulating subject experience, in a separate experiment, they tested subjects who had taken an introductory programming course in BASIC or FORTRAN. In general, these subjects performed better than the subjects inexperienced in programming. On the hard problems, they did not find a statistically significant difference between subjects using TABLET and SQL on the final test, but did find that TABLET subjects produced significantly more accurate queries on the retention test. This suggests that prior procedural programming experience aided in the retention of the procedural language.
However, prodedurality is not the only important variable in designing usable query language interfaces for IR. Michard (1982) criticized the Welty & Stemple study, suggesting that the two compared languages "had quite difference syntatic [sic] forms and that the score difference in a query writing task observed for complex queries could be interpreted as having another source than proceduality degree" (p. 280). Michard suggested that an important design principle for query languages should be non-procedurality, and hence compared two query systems, both non-procedural. However, one was textual and the other graphical. The graphical system was called "graphical query language" (GQL). This system was based on Venn diagrams, which are often used to represent sets. This Venn-based system was compared to an textual ad hoc query language called TEST, in which users described set operations by using Boolean operators rather than indicating regions on Venn diagrams.
Michard found that subjects made four times as many errors using TEST than when using GQL. In analyzing errors, Michard found:
Thus, Michard argued that a Venn-based graphical query system is superior to one based on Boolean connectives.
Young and Shneiderman (1993) also created a graphical system of representation for Boolean queries they called "filter/flow" and compared it to a textual system. In the filter/flow system, Boolean terms are represented by nodes (i.e. "filters") on a directed graph. Candidate members of the solution set picked out by a given query are imagined to flow through the directed graph, and pass through filters. Such filters only allow members of the solution set to pass through. The OR connective is represented by placing filters in parallel. The AND connective is represented by placing filters in series.
They compared the filter/flow interface to an SQL interface. The filter/flow interface is also more procedural than SQL. They tested subjects with both a comprehension task, in which subjects were required to indicate the results of a query expressed using either filter/flow or SQL, and a composition task, in which subjects were required to translate a query from English into one of the two query languages.
Young and Shneiderman found a significant difference between the filter/flow and SQL conditions in the number of correct responses in both the comprehension and composition tasks. In both tasks, subjects formulated more correct answers in the filter/flow condition than in the SQL condition. They also report that subjects were more subjectively satisified with the filter/flow interface than with the SQL interface.
In the comprehension task, they found that subjects had little difficulty with relatively simple queries (e.g. three conjunctions, or a query similar to one of those in the training set) in either interface, but were better able to comprehend more complex queries expressed using filter/flow. In SQL, subjects often failed to apply rules of precedence and misinterpreted parentheses.
In the composition task, they reported that the simplest queries to form in both interfaces was the combination of three intersection operators and the negation of one of the terms. As in the comprehension task, subjects made the most errors in SQL in the use of parentheses. Another common error was for subjects to use "AND" instead of "OR." Young and Shneiderman thus argue that their graphical filter/flow system is superior to a textual SQL system.
The empirical findings to date, though far from conclusive, suggest that several task and interface variables are important determinants of performance: procedurality, form, and graphics vs. text. In the present paper, we provide the first experimental examination of all three variables in an information retrieval task.
All of the studies that empirically tested users' ability to formulate queries were based on the relational data model, i.e. they compared various query systems to SQL or a similar system. However, the more unstructured data in the World Wide Web has become increasing relevant as a target of information seeking, leading to the emergence of Web search engines as a starting point for many searches. There is little published work on the topic of formulating queries for the Web.
Because of the difficulty that inexperienced searchers often have formulating Boolean queries, other query systems have been developed, particularly for the Web, where many users do not have formal training in information retrieval. For example, using a vector model (Korfage 1997), retrieval results can be rank ordered by similarity with a query consisting of a simple series of search terms. Weights may be assigned to both search terms (terms mentioned first in the query are often assigned higher weights) and to documents (common terms such as prepositions might be assigned low weights). These heuristic weights are often assigned without the user's control. We believe there is still a place for explicit Boolean queries, even on the Web, if they can be expressed in such a way that naive users can formulate them. Further ordering the results of such Boolean queries would likely be beneficial, especially when the number of results returned is very high.
Hence, we attempt to extend the work comparing various types of Boolean query systems into forms that could be implemented on the Web. Like Welty & Stemple, we compare procedural and non-procedural query systems. Because such downloadable executable code such as Java applets allow graphical query systems to be used as well as textual ones, we also compare both graphical and textual systems. Finally, Greene et al, Welty & Stemple, and Michard all suggest that there are some significant differences in the ability of users to generate queries of different Boolean forms. We study the ability of users to formulate queries with varying degrees of Boolean complexity.
24 paid undergraduate subjects were recruited from telecommunications classes at Indiana University - Bloomington (17 female and 7 male). Subjects were required not to have studied Boolean logic, set theory, or Venn diagrams. They were also required not have had any computer programming experience. 3 male and 9 female subjects were assigned to the text group. 4 male and 8 female subjects were assigned to the graphical group.
A three-factor (procedurality, form, text/graphics) experimental design with partial repeated measures was employed. Subjects were assigned randomly to either a text group or a graphical group. Each subject was then tested in both a procedural condition and a nonprocedural condition, formulating 16 queries in each condition. Each of the queries has one of eight underlying Boolean forms, e.g. A AND B AND C. The following matrix illustrates the four conditions under which users were tested, separated by the dimensions of procedurality and text/graphics.
| Procedural | Non-procedural | |
|---|---|---|
| Text | If-Then-Else | Boolean |
| Graphics | Filter-Flow | Venn |
If-Then-Else is a procedural, textual system for expressing Boolean queries. It is similar to a part of many programming languages. In this system, queries take the following form:
IF condition THEN result1 ELSE result2
In an actual query, condition is replaced with a word that might be found on a Web page. result1 and result2 are either replaced with "ACCEPT," "REJECT," or another IF-THEN-ELSE statement. A query basically takes the form of a program written to examine each Web page in turn, and respond with an "ACCEPT" if the query should return the page, or a "REJECT" if the query should not return the page. For example, the following query would return all of the Web pages that mentioned either "dog" or "cat" but not both (i.e., exclusive OR):
IF dog
THEN IF cat
THEN REJECT
ELSE ACCEPT
ELSE IF cat
THEN ACCEPT
ELSE REJECT
The system using Boolean connectives is non-procedural and textual. Queries in this system are formed from the connectives AND, OR, and NOT, as well as parentheses (Without parentheses, AND was considered to have higher precedence than OR.). For example, the following query is one that returns pages with either "dog" or "cat" but not both:
(dog OR cat) AND NOT (dog AND cat)
The Filter-Flow system used in this experiment is a procedural, graphical system modeled on the system described in Young & Shneiderman. A filter in this system is represented by a box with a word in it. Such a filter permits only those pages to pass through it that contain the word in the box. These filters are connected by lines with directional arrows. A filter may also be modified by the addition of a circle with an "X" through it, creating a "negative filter." Such a filter will only allow pages to pass through it that do not contain the word in the box. For example, the following query returns pages containing either "dog" or "cat" but not both:
The Venn query system used in this experiment is a non-procedural, graphical system based on Venn diagrams. In this system, the set of all Web pages that contain a particular word is represented by a closed figure labelled by that word. Each such closed figure included in a query must overlap all of the other subregions created by other such closed figures. Regions in which two closed figures overlap represent Web pages that contain both of the words associated with the overlapping closed figures. Regions that are shaded by the subject represent those Web pages that will be returned by the query. For example, the following query will return the exclusive OR of pages containing "dog" and pages containing "cat":
While it is quite simple to draw a well-formed Venn diagram for one, two, or three terms, it is quite difficult to draw a Venn diagram for four terms. The number of subregions in a Venn diagram increases exponentially as a function of the number of terms (2^n-1 subregions for n terms). Hence, a Venn diagram with four terms has 15 subregions:
As a result of this difficulty, in this condition, subjects were provided with unlabelled diagrams containing the appropriate number of closed figures for the given queries. In a working on-line system, the computer would generate such diagrams. Subjects were required to label the closed figures, and shade in subregions.
Each subject was tested in both procedural condition and a nonprocedural condition. These conditions were counterbalanced across subjects by order. For each condition, the subject was given a training session in which the query system was introduced. Subjects were shown how to express conjunction (AND), disjunction (OR), and negation (NOT). In the case of Boolean operators, precedence and parentheses were also introduced. Finally, a pre-test that required the subject to express an exclusive OR query was given. XOR was chosen because it requires the use of conjunction, disjunction, and negation. The subject was permitted, during training, to ask the experimenter for aid in formulating the query. After correctly completing the pre-test, and positively expressing an understanding of the query system, the actual trials began.
For each condition, subjects were required to formulate 16 queries. Each query was described in English at the top of a piece of paper, and space was provided underneath for the subject to write/draw the query using the system in which they had just been trained. The English query descriptions were formulated not use either of the words "and" nor "or." Each of the 16 queries took one of eight Boolean forms (each Boolean form occurred twice). The order of these forms was randomized for each subject. The forms used are listed below:
Form 1 was used in both Young & Shneiderman and the Michard studies, as well as being described in Greene et al (AND alone). Form 2 is described in Greene et al (OR alone). Form 3 is designed to correspond to NEGATION in Greene. Forms 4, 5 and 6 were used in both the Michard and Young & Shneiderman studies. These forms correspond to the combinations of AND+OR in Greene. Form 7 was used in the Michard study. Form 8 can be found in the Young & Shneiderman study.
Immediately following the completion of the 16 problems, the subject was given a satisfaction survey including five questions.
In the following section, we analyze, by independent variable, the percentage of queries correctly formulated by subjects.
Using a multivariate analysis of variance (adjusted for non-sphericity), significant main effects were found for procedurality and form (both within-subjects factors). Significant interactions were observed for procedurality * text/graphics, procedurality * form, and the three-way interaction procedurality * form * text/graphics:
| Source of Within-Subjects Effects | Type III Sum of Squares | df | Mean Square | F | Sig. |
|---|---|---|---|---|---|
| Procedurality | 14.109 | 1.000 | 14.109 | 13.954 | .001 |
| Procedurality * Text/Graphics | 13.246 | 1.000 | 13.246 | 13.100 | .002 |
| Error(Procedurality) | 20.223 | 20.000 | 1.011 | ||
| Form | 80.278 | 5.080 | 15.824 | 28.206 | .000 |
| Form * Text/Graphics | 5.924 | 5.080 | 1.166 | 2.079 | .073 |
| Error(Form) | 56.994 | 101.593 | .561 | ||
| Procedurality * Form | 11.391 | 6.409 | 1.777 | 4.465 | .000 |
| Procedurality * Form * Text/Graphics | 8.345 | 6.409 | 1.302 | 3.271 | .004 |
| Error(Procedurality * Form) | 51.027 | 128.176 | .398 |
No significant main effect was found for the between-subjects factor of text vs. graphics.
| Source of Between-Subjects Effects | Type III Sum of Squares | df | Mean Square | F | Sig. |
|---|---|---|---|---|---|
| Intercept | 315.746 | 1 | 315.746 | 117.546 | .000 |
| Text/Graphics | 4.837 | 1 | 4.837 | 1.801 | .195 | Error | 53.723 | 20 | 2.686 |
Overall, accuracy was significantly better in the non-procedural
conditions than in the procedural conditions:
Based on the existing literature on query formulation, we hypothesized that query accuracy would monotonically decrease for Boolean forms starting with Form 1 and ending with Form 8. A rather striking exception to this hypothesized result came in Form 5 (A OR (B AND C)), for which subjects made many more errors than in formulating queries of Form 6 ((A OR B) AND NOT C).
In fact, subject performance was much lower on Forms 5, 7, and 8 than on the other Boolean forms. All three of these forms exhibit a "global OR" structure, i.e. if these forms were diagrammed on a grammar tree, the top node would be an OR. Many subjects generated queries that suggested they viewed these queries as having a global AND (conjunctive) structure, rather than a global disjunctive structure.
Performance was best using Boolean connectives (nonprocedural text). The two graphical conditions were close to the same performance. Performance was worst in the procedural text condition:
While performance was generally better in the non-procedural conditions than in the procedural conditions, there is a significant interaction that appears in forms 4 and 5:
Once again, the three-way interaction seems to appear mostly in forms 4 and 5. Note the peaking on Form 4 for both procedural conditions. The procedural text condition seems to have a very different profile than any of the others. However, the rank-ordering of Boolean, then Venn, Filter-Flow, and If-Then-Else is robust for the first few forms and the last few forms:
Significant effects for satisfaction were found for both procedurality and the interaction term procedurality * text/graphics:
| Source of Within-Subjects Effects | Type III Sum of Squares | df | Mean Square | F | Sig. |
|---|---|---|---|---|---|
| Procedurality | 270.750 | 1 | 270.750 | 21.330 | .000 |
| Procedurality * Text/Graphics | 192.000 | 1 | 192.000 | 15.126 | .001 |
| Error(Procedurality) | 279.250 | 22 | 12.693 |
A marginally significant effect was found for the between-subjects factor of text/graphics:
| Source of Between-Subjects Effects | Type III Sum of Squares | df | Mean Square | F | Sig. |
|---|---|---|---|---|---|
| Intercept | 10680.333 | 1 | 10680.333 | 470.327 | .000 |
| Text/Graphics | 102.083 | 1 | 102.083 | 4.495 | .046 | Error | 499.583 | 22 | 22.708 |
Subjects seemed to prefer the non-procedural conditions to the procedural conditions, and to prefer the graphical conditions over the textual conditions:
Satisfaction scores were highest for the Boolean condition, followed by Venn and then Filter-Flow. If-Then-Else received much lower satisfaction scores than any of the other conditions.
These data reveal several interesting issues. Despite their comparative performance, subjects generally have difficulty expressing Boolean queries, incorrectly expressing almost 40% of their queries in this condition. Performance was highly affected by the underlying Boolean forms of the queries. As previously noted, queries with greater numbers of terms were more difficult to correctly formulate. Conjunction is easier than disjunction. Similarly, negation makes queries more difficult to express.
However, subjects' performance on Boolean forms with a "global OR" structure was unexpectedly low. It would be interesting to try to isolate the source of these errors. Unlike Greene et al, we did not test the reading comprehension of the subjects with regard to the English query descriptions. What we observed might simply be the result of a difficult wording of the English queries or a difficulty in reading comprehension. However, they might also reflect a deeper level of difficulty in human processing of disjunctive information. Johnson-Laird (1983) postulates that human processing of logical syllogisms is limited by the number of alternate models that can be simultaneously maintained in working memory. Disjunctive queries might require subjects to entertain several "models" of the required results e.g. pages that contain one term, or the other, or both, thus taxing the working memory capacity of searchers. These issues are ripe for further study.
On a more practical note, as Sewell & Teitelbaum (1986) observe, subjects seeking information could achieve implied OR's by repeated queries, rather than using a disjunctive formulation of a single query. Sewell & Teitelbaum found that AND was the only Boolean operator used in over half of the searches they studied during an eleven year period. Miller, Kirby, and Templeton (1988) studied queries to a CD-ROM version of MEDLINE at a medical library, and found that AND was used in 58 percent of search statements, compared to OR being used in 2 percent of searches. Users appear to be uncomfortable using OR in some actual search situations.
Indeed, user satisfaction and performance appear to be highly correlated. Subject performance was highest for the Boolean condition, followed by Venn, Filter-Flow, and then, with much lower scores, If-Then-Else. The same pattern was found in the satisfaction ratings.
While subjects were recruited with the instruction that they should not sign up for the experiment if they had studied Boolean logic, all of the subjects were American undergraduate university students, and given that the teaching of Venn diagrams is common in high-school and first-year level mathematics classes, it is likely that many of the subjects had been exposed to this formalism. Further, because of the presence of Boolean connectives in such widely-used system as on-line library cataloguing systems, it is also likely that many of the subjects had been exposed to this type of system. Because subjects performed best in the Boolean and Venn conditions, it is not clear to what extent prior experience had on their higher performance. This variable must be better controlled in future studies.
What are the implications for interface design? At this point, the traditional Boolean connectives still appear to be the best alternative as a system for expressing Boolean queries. However, user performance with this system is still far from optimal. Further, subjects using Boolean connectives did not have the highest accuracy scores in three of the eight Boolean forms we studied. In these cases, user performance was superior using graphical query systems. This study reveals several other interesting effects (low performance on global OR's structure, interactions in forms 4 and 5) that might point the way toward designing systems that could improve the ability of naive users to formulate specific types of Boolean queries.
Firstly, it would be interesting to implement some of these systems in an on-line Web format, rather than just in paper-and-pencil form. Some of the difficulties with syntax could be obviated by interface designs that only allow well-formed queries to be generated.
Also, we did not directly determine the degree to which comprehension of the English queries affected subjects' performance. In particular, was subject performance on the "global OR" forms influenced by a lack of understanding of the logical forms of the queries themselves, rather than a difficulty in using particular query systems? Using some kind of verification measure, such as that used by Greene et al, would help tease out some of these issues.
In sum, the underlying cognitive issues that affect user performance with any search interface are not well understood. It is clear that by attending to issues such as the level of procedurality, form, and graphical representation of queries, we may impact the performance of an IR system significantly. In our view, understanding such variables is likely to lead to substantive gains in performance.