Research on perception of these simple figures has led to alternate views on the underlying representations and mechanisms. All models of vision have some representation, in the form of the input buffer if nowhere else, which is simply a two-dimensional brightness map. This two-dimensional image may be pre-processed with smoothing and line thinning to eliminate noise, yielding another two-dimensional image, easier to process than the input. This is often a preliminary step in Optical Character Recognition systems [Kahan, et al.], [Mantas]. While such image refinement, in people and computer systems, may aid in accurate perception, it is simply pre-processing, and does not in and of itself perceive the figure, nor does it render a qualitatively different kind of representation from the original two-dimensional input.
Two-dimensional maps, whether simply the original input or a pre-processed version, make up the first of three broad types of representation. Two-dimensional brightness maps will henceforth be referred to as syntactic representations. Any of several methods can directly produce category judgements based on syntactic representations [Townsend], [Bouma]. Townsend compares several categorization methods (i.e., General Activation, All-or-None, and Luce's Theory of Choice) based on how well their theoretical prediction matrices correlate with human error-making in categorization of upper-case letters presented tachistoscopically. These models are all alike in positing no higher-level representation, however, and use only the image map. Syntactic representations are certainly part of any visual system, but need not be the representation consulted in any final categorization process.
The second broad type of representation consists of those which calculate higher-order functions from the syntactic input map. These are generally known as features [Gibson]. Three types of feature representations can be distinguished: those which detect the presence or absence of some local image detail, such as a tip; those which detect the presence or absence of a global property of the entire image, such as symmetry; and those which are themselves two-dimensional maps, but denote in each location the presence or absence of an image detail in that specific locality. Both psychologists [Gibson], [Keren and Baggen] and OCR researchers [Mantas] have invoked feature representations. While there is compelling evidence to suggest their use in perception, feature representations will not be given careful consideration in the following discussion, due to the absence of a agreed-upon set of features used in perception of lower-case letters, which will be the standard in comparisons made here.
The third type of representations are structured ones which parse the image into parts. A level of representation higher than the image map contains two-dimensional images of only a portion of the original figure. Multiple images can represent each of the parts of the figure [Palmer], [Biederman]. Finally, relations between the parts can also be included [McGraw]. For example, a structured representation of a lowercase 'f' might consist of two images: one depicting the crossbar alone, and the other showing the curved vertical staff, plus the information about how and where they are connected. These figures may then activate established semantic entities, or roles (such as "wing" or "tail of a 'q'"), the activation of which influence a yet-higher level, the category representations, or may be manipulated as novel figures for non-categorization tasks. A related taxonomy of representation styles can be found in [McGraw].
Thus, there is evidence for three different types of representation in perception of line figures, and each experimental study cited above shows the primacy of one type or another. This apparent paradox is resolved by noting that each model explains human data collected under some experimental conditions, but that (at least) two mutually exclusive cases exist. The studies suggesting syntactic representations all involve degraded stimulus presentation, limited either by brief presentations or low acuity of the stimulus (achieved either by making the stimulus very small or moving it radially from the fovea). All studies finding that structured representations are used allow plenty of presentation time and a clear, foveal view of the stimulus. It seems that the extra time is needed to construct a structured representation. By studying parallel data on error-making in the task of letter perception one can see that both syntactic and structured representations are involved in categorization tasks. This rejects the implicit assumption in each of the above papers that one model can explain all perceptual events for line figures.
Survey of Literature- Breakdown by Method
Researchers often seek insight into perceptual processes by studying
error-making, and adjusting exposure durations, or the overall clarity
of the stimulus image, are two ways to obtain a suitable error
rate. Using the lower-case roman alphabet as a stimulus set,
experimenters have calibrated stimulus exposure durations or image
acuity to derive error matrices with error-making fixed at 50%
[Townsend], [Gilmore, et al.], [Geyer]. The previously-cited authors
adjust stimulus exposure alone, while [Bouma] employs three different
methods in producing error matrices in order to "get an idea of the
extent to which the method itself is of influence". In one of Bouma's
experiments, the alphabetic stimulus is displayed briefly, and outside
of the foveal area, laterally displaced from the center of the visual
field. In the second, the stimulus is displayed at a great distance
from the subject, subtending only 7' of arc. As we will soon see, the
perception involved in all of these methods point to fundamentally the
same sorts of representations.
A different representation is suggested by studies which allow ample time and acuity. The goal of producing errors, in spite of these improved viewing conditions, has been met by using unusual and stylistically-varied letters as stimuli [McGraw, et al.] or by combining variety of stimulus with a unit degradation of deletion or addition of segments as in [Sanocki]. Additionally, perception of letter-like figures in [Palmer], [Palmer] has been probed with parsing and goodness-of-part grading tasks involving no temporal or acuity impairment. All of this latter group of studies have concluded that structured representations are invoked.
If one considers only perception of two-dimensional figures, not meant to represent three-dimensional objects, and studies which propose either syntactic or structured representations, there is a double dissociation along these lines, with no research based on brief exposures or low acuity suggesting a structured approach, nor any research allowing plenty of time and acuity recommending a syntactic representation. See Table 1.
| Syntactic | Structured | |
| Limited Time or Acuity | Townsend, Gilmore, et al., Bouma, Geyer | |
| Free Time, High Acuity | Palmer, Sanocki, McGraw, et al. |
At this point, it is important to identify all possible explanations behind the double dissociation between experimental method and experimenter's conclusion. The first possibility is that syntactic and structured representations each predict human data equally well in all experimental paradigms. [McGraw, et al.] dispel this possibility by showing that their data is predicted much better by structured representations than by syntactic ones. The second possibility is that one form of representation always predicts human data best, regardless of experimental paradigm. In light of the findings of [McGraw, et al.], that would more specifically state that structured representations are always superior. The third possibility is that different representations are inferred by different experimental paradigms. In order to thwart the second possibility, and establish that syntactic representations are not simply inferior predictors of human behavior in all cases, the next section compares datasets collected with short stimulus exposures and limited acuity to syntactic or structured theoretical prediction matrices.
Analysis of Error Matrices- Geyer, Bouma vs. McGraw, et al.
While many studies use uppercase letters in deriving alphabetic error
matrices, availability of theoretical prediction matrices for
lowercase letters, as well as sufficient variety in method of
presentation lead this section to focus on four lowercase confusion
matrices. [Geyer] provides one error matrix based on the lowercase
roman alphabet obtained with short enough exposures to fix for each
subject, in a preliminary trial block, correct identification at
50%. What this interval is is not noted, although a similar design
using uppercase letters conducted by [Gilmore, et al.] found that
stimulus durations ranging widely from 10 to 70ms were
appropriate. [Bouma] produced a similar lowercase alphabetic confusion
matrix (not including 'y') in which durations of 100ms were used, with
acuity was restricted by moving the stimuli outside the fovea. The
lateral displacement varied for each subject, and was designed to fix
error rates at 50%. These exposure durations resonate with the finding
of [Treisman and Gelade] that the presence of an item distinguished
from its field by a single feature can be identified correctly at an
80% level given 65ms mean exposure durations, but the same task
involving a conjunction of features requires 414ms. If one can
conclude that a similar decision procedure is used in both cases, then
the difference in reaction time is due to greater latency in forming
representations needed for the second task. This suggests a connection
between structured representations and the representations Treisman
and Gelade infer as the result of focused attention. We also consider
Bouma's matrix produced by limiting acuity with visually extremely
small stimuli, which also prohibits the formation of structured
representations, since the necessary visual information is not
available at all.
Correlating an error matrix of human data with a theoretical prediction matrix based on a certain representation type demonstrates the similarity between that representation type and that used by humans in the categorization task. A theoretical prediction matrix predicts (mis)identification of tokens of category A as tokens of category B as being proportional to a constant between 0 and 1 raised to the power of the distance between categories A and B for the given representation type. Adjusting the constant allows one to fix the percent error of the theoretical matrix to that of the empirically-derived matrix, to maximize fit. [McGraw, et al.] have calculated inter-category distances for the lowercase roman letters based on both syntactic and structured ("proto-role") representations. In order to investigate the predictive value of each representation type in each experimental paradigm, we can examine correlations between the experimental error matrices of [Geyer], [Bouma], and [McGraw, et al.] and syntactic and structured theoretical matrices. Each correlation uses the off-diagonal (error) responses of each experimental matrix and the theoretical matrix tuned to the same percent error (50% for Geyer and Bouma, 84% for McGraw, et al.). These correlations are reported in Table 2, and in Figure 1.
| Pearson R-values | Syntactic | Structured |
| Bouma- Eccentric, Short Exposure | 0.492 | 0.316 |
| Bouma- Low Acuity | 0.575 | 0.428 |
| Geyer | 0.477 | 0.371 |
| McGraw, et al. | 0.368 | 0.884 |

Thus, the effects found are rather striking, and the clear implication is that the perceptual system can categorize using either of two types of representation (the syntactic and structured), depending upon experimental circumstances. In particular, the subject produces a structured representation if permitted to do so, but otherwise must rely upon a less-informed syntactic representation. OCR models based on syntactic representation have been shown to make errors that human viewers never would, (The problems with syntactic representations are discussed at length in [McGraw].) The fact that structured representations are higher-order, and require additional computation to derive gibes well with the finding that these are only seen to be involved when the time and information needed to perform such computation is available. [McGraw, et al.] report that subjects take at least 500ms to respond to their stimuli, and as several varied experimental paradigms have shown that simple responses can be made in a fraction of that time, we can conclude that several hundred milliseconds are needed to form structured representations of lowercase letters. That resonates, again, with [Treisman and Gelade], who show that the use of attention to detect conjunctions of features requires about that much time.
The findings of this paper enrich the views of the cited researchers, all of whom either fell into the category error that there is but one representation type and process used in visual perception [Townsend], [Geyer], [Keren and Baggen], [McGraw, et al.], or postulated multiple layers of representation, but held that only the highest-level representation influences behavior [Marr], [Biederman]. Finally, [Bouma] used different experimental paradigms in an attempt to explore possible differences, but happened to choose two paradigms which both led to the use of syntactic representations. It is not hard to imagine that other cognitive tasks now bearing one unitary label in the literature are actually performed in multiple ways, involving multiple types of representation, with the representation(s) actually used in shaping behavior varying from instance to instance, depending on situational variables.
As to why at least two modes of control over behavior, one consulting syntactic and the other semantic representations, would exist is easy to see. Dual-control structures are common in the nervous system. For example, the speed of breathing may be controlled by conscious effort, but matches the needs of the body based upon physical activity with no conscious thought at all. This allows for breathing to proceed continuously during normal activity, but may be moderated beneficially during unusual events (such as swimming). In the case of visual perception, the ideal would be a rich representation, produced very quickly, which distinguishes all characteristics of possible relevance. Since this is not possible, two representations exist. The syntactic operates very quickly (on the order of 100ms or less), and gives the organism the best information available in such a short time. Meanwhile, structured representations are formed, and while this is slower (on the order of 400ms), and may not be available if the exposure to the visual stimulus was too brief, it will in those instances where it is available inform the organism better than the syntactic representation. Thus, by choosing the best representation available, the organism is better informed than would be possible if only one type existed.
An analogy may be made to the hierarchy of computer memory, in which very small, very fast memories exist in a central processing unit. A larger, and not-quite-so-fast memory is found in the cache. Still larger and slower memory exists in the random access memory, and massive but ponderous storage on disk drives and other devices loom furthest from the processor. This hierarchy of memory provides a trade-off suitable for each situation, and the computer can use a fast memory when that is possible, and a large memory, when that is necessary. Again, the overall function is vastly better than if only one type were available. (A single memory with all the virtues is technically impossible.) In the case of visual perception, this paper shows that at least two places along the tradeoff continuum are used. It would be desirable to investigate the use of other possible representation types. Feature representations could be explored given a fixed list of features for discriminating lowercase letters that could be used in generating inter-category distances for a theoretical prediction matrix. Feature lists in the literature, however, are all either given only for one specific stimulus set (in no case the lowercase letters), or if intended to be general, lack the discriminatory ability over the lowercase letters to be convincingly general. [Keren and Baggen], [Gibson].
Another interesting question is what makes the decision as to which of two representations is consulted in producing a given behavior. Might more than one be involved in a single act? Does the structured representation always win, when available? Is this the sort of decision that makes up the basis of decision-making in general?
Finally, it is interesting to wonder if the mechanisms associated with
consciousness are the ones making decisions here. Or is all cognition
leading to structured representations conscious, while syntactic
representations are produced by low-level processes, requiring no more
thought than a heartbeat? More detailed studies of the perceptual
architecture might discover processes and structures that mediate
consciousness. If a letter-recognition experiment involving subjects
with blindsight showed that they are denied making structured
representations, given unlimited time, then the reason for that
difficulty may be narrowed either to poor acuity, which could be
probed for experimentally, or the necessity of conscious, rational
thought in making such perceptual representations. Certainly comparing
results of perceptual experiments with the predictions of multiple
types of representation will provide greater insight into the
perceptual architecture than assumptions that there is only one type
of representation ever can.
References
Biederman, I. (1987). Recognition by components: A theory of human
image understanding. Psychological Review, 94(2):115-147.
Bouma, H. (1971). Visual recognition of isolated lower-case
letters. Vision Research, 11:459-474.
Geyer, L. (1977). Recognition and confusion of the lowercase alphabet.
Perception & Psychophysics, 22(5):487-490.
Gibson, E. (1971). Perceptual learning and the theory of word
perception. Cognitive Psychology, 2:351-368.
Gilmore, G., Hersh, H., Caramazza, A., and Griffin,
J. (1979). Multidimensional letter similarity derived from recognition
errors. Perception & Psychophysics, 25(5):425-431.
Kahan, S., Pavlidis, T., and Baird, H. (1987). On the recognition of
printed characters of any font and size. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI-9(2):274-288.
Keren, G. and Baggen, S. (1981). Recognition models of alphanumeric
characters. Perception & Psychophysics, 29(3):234-246.
Mantas, J. (1986). An overview of character recogntion
methodologies. Pattern Recognition, 20(1):1-6.
McGraw, G., Rehling, J., and Goldstone, R. (1994). Roles in letter
perception: Human data and computer models. Technical report 90,
Center for Research on Concepts and Cognition, 510 North Fess,
Bloomington, IN, 47405.
McGraw, G. (1995). Letter Spirit (part one): Emergent high-level
perception of letters using fluid concepts. PhD thesis, Indiana
University, Department of Computer Science and the Cognitive Science
Program, Bloomington, Indiana.
Marr, D. (1982). Vision. San Francisco: Freeman.
Palmer,
S. (1977). Hierarchical structure in perceptual representation.
Cognitive Psychology, 9:441-474.
Palmer, S. (1978). Structural aspects of visual similarity. Memory
& Cognition, 6(2):91-97.
Podgorny, P. and Garner, W. (1979). reaction time as a measure of
inter- and intra-object visual similarity: Letters of the
alphabet. Perception & Psychophysics, 26(1):37-52.
Sanocki, T. (1987). Visual knowledge underlying letter perception:
Font-specific, schematic tuning. Journal of Experimental
Psychology, 13(2):267-278.
Townsend, J. (1971). Alphabetic confusion: A test of models for
individuals. Perception & Psychophysics, 9(6):449-454.
Treisman, A. and Gelade, G. (1980). A feature-integration theory of
attention. Cognitive Psychology, 12(12):97-136.