Structural Analysis of Computer Vocabulary:  How do Users Conceptualize the Computer?

Adity Mutsuddi

 

Introduction

 

Computers in all forms and sizes are used by people with different levels of computer expertise. Among these users, there are many who do not have any technical knowledge about the computer– they use it to “get something done”. What do these users think about the computer? What are the most frequently used words? This paper attempts to discover the concepts users have about the computer from statistical analyses of the words they had used when asked to complete some computer tasks and to describe specific technical terms they might have encountered when using computers in their daily life.

 

Words used to tag online resources such as videos, papers and photos [1, 4] have been investigated to understand the structure of social tagging systems on the Internet. Words used to describe interests in user profiles of online journals have been analyzed to understand the structure of user interests [5] and the changes in those interests [3].  In this paper, words used by participants during the computer tasks and the exit interview in which they described computer-related terms are analyzed to understand user concepts of the computer.

 

Data collection

 

This paper is part of a larger project [2] which used design principles to design secure and usable interfaces for users to administer their computers. User studies were performed to examine both the current Windows XP and the newly designed interfaces. The participants were asked to complete four tasks based on a scenario of four family members. The participants had to 1) create accounts for the family members 2) login to

visit a website that would show a certificate 3) login to install software and 4) configure the system. Joint-exploration[1] [6] was used to elicit the thought process of the participants during the tasks. Upon completion of the tasks, the participants had an exit interview in which they were asked to describe technical terms (e.g. operating system, firewall, network, administrator) in addition to questions related to their tasks.

 

The vocabularies for this paper have been compiled from videotapes of the discussions participants had with the joint-explorer and the exit interview. There were twenty-two (n=22) participants of ages between 25 and 65 years old who used computers regularly and did not hold a degree in a computing related field. Any technical word used by the participants during the tasks and the interview was included in the data.  Words like

“login”, “install”, “user”, “administrator”, and “computer” that were used in the text to describe the tasks were excluded. Words that were in the questions during the exit interview were also excluded. For example, if the answer to the question “What is an anti-virus?” consisted of the word “virus”, then it was not included. If it was mentioned in other situations it was included. Similarly, if the task was to comment on the certificate that showed up when they visited a website, then the word “certificate” was not included. Furthermore, words like “things” and “stuff” were not included.

 

Technique for Analysis

 

The analytical technique applied in this paper has been successfully implemented in [3, 4, 5].  First, the data is analyzed using Principal Component Analysis (PCA) and then employing Hierarchical Cluster Analysis (HCA).

 

The PCA allows the characterization of the major dimensions of variation in the data through one of its results, the principal component scores. Each dimension explains parts of the variance in the initial data. By analyzing the space of the PC scores, it is possible to choose a reduced number of dimensions that describes the variation observed in the data. The initial data can then be projected onto the space formed by the reduced number of dimensions.  As a result of this reduction, interpretation of the data becomes simpler. [4]

 

To view the structurally similar groups of data, Hierarchical Cluster Analysis has to be employed. Therefore, the PC distances between the pairs of data points in the reduced space has to be calculated which gives approximations of how strongly the data points are related. Data points that are closely related will be located in close proximities in the PC space. HCA is then applied on the PC distances to group the data points that are structurally similar. [4]

 

In this paper, the PC distances show the similarity of the users and applying the HCA on the distances will reveal the groups of users sharing the usage of similar words.  The clusters of words may identify some coherent groups of users indicating the concepts users have about the computer.

 

Data Description

 

The vocabulary data consists of 83 unique words with 33 words that were used in more than one context (the technical term or the task that was being described by the word). In total there are 175 data points. The data consists of the context in which a word was used (Context), a reduced context (Context_Redefined) that grouped together similar types of contexts and a further reduction that categorized the words into 6 categories (Category). The categories are: accounts, configuration, general_computer, network, operating system and security. The numerical information consists of the total number of times a word was used (Sum), the total number of users who used the word (num_ppl) and the number of times each user (S1, S2 and so on) used a word. Table 1 shows the layout of the data.

 

For analysis, any information on the number of usage of a word was excluded to normalize the effect of a word that might have been used by a user many times, mostly to explain repeated ideas. In addition, if a word was used many times by only one user then the word was not as widely used as the analysis may suggest. Instead, “1” was used to denote the usage of a word under a particular context and “0” otherwise.

 

 

Words

Context

Context_Redefined

Category

Sum

num_ppl

S2

S5

S6

S7

connected

Network

network

network

10

8

0

0

1

1

control

account type

account type

accounts

2

1

0

0

2

0

control

access to computer

access to computer

general computer

2

1

0

0

2

0

cookies

Configuration

configuration

configuration

3

3

0

0

0

0

cookies

Adware

malicious objects

security

1

1

0

0

0

0

cookies

Spyware

malicious objects

security

2

2

0

1

0

0

customize

account type

account type

accounts

1

1

1

0

0

0

Table 1: The initial layout of the vocabulary data

 

After the initial analysis it was discovered that the existence of repeated words in different contexts failed to provide any structure in vocabulary usage. Multiple clusters contained the same word and there were no observable coherent clusters. This could be an indication that even though a word was used to describe different terms or tasks, the essential meaning of the word was still the same. For example, the word “files” used to describe a virus means the same as when it was used to describe a download or a process. The user was talking about data that could be destroyed by the virus, downloaded or executed in the computer. The word “malicious” means “bad” whether it was used to describe bad software or a bad web site. The most important information is the usage of the word and not its context. As a result the layout of the data was further reduced to contain only words, the number of users who used a word and the number of times each user (S1, S2 and so on) used a word. Table 2 shows the final structure of the data.

 

Words

num_ppl

S2

S4

S5

S6

S7

access

17

0

0

1

1

1

affect

1

0

0

0

0

0

applets

1

0

0

0

0

0

applications

3

0

0

0

0

0

attachment

1

0

0

0

0

0

authenticity

2

0

0

0

0

0

authorization

1

1

0

0

0

0

block

5

0

0

1

0

0

break

1

0

0

0

0

0

browser

1

1

0

0

0

0

code

2

0

0

0

1

0

Table 2: The final layout of the vocabulary data

 

 

S2        S4        S5        S6        S7        S8        S9        S10      S11      … S23

0          0          1          1          1          1          1          1          1          … 1

0          0          0          0          0          0          0          0          1          … 0

0          0          0          0          0          0          0          0          0          … 0

Table 3: Layout of the data used for analysis

 

Results and Analysis

 

The programming and statistical language “R” [7] was applied to do the analyses. The matrix of words and data (Table 3) was entered in the PCA function “prcomp” with scaling. A plot of the sdev (Figure 1) shows the number of dimensions to analyze. The “elbow” of the plot is usually the point below which the dimensions are not significant for the variation in the data [4]. In this plot, the “elbow” is not clearly defined. However, after the first two dimensions there is a slight dip in the graph. The third dimension was explored but it was found to be insignificant. Hence, we will do the HCA analysis by exploring the first two dimensions only.

 

The Ward’s hierarchical cluster analysis method was applied on the first two dimensions of the PC scores, using Euclidean distance as a measure of similarity of the words. The result of the cluster analysis is presented as a dendrogram plot in Figure 2. The dendrogram shows that a cut of three (height of about 30) is composed of one huge cluster and two very small clusters of sizes four and eight. The large cluster is composed of four moderate size clusters and the smaller cluster splits into several very small clusters.  The lower the height, the smaller cluster breaks into even smaller clusters. A cut of eight results in clusters of sizes 1 and 2 which are not useful in suggesting concepts that users might have. Therefore, a cut of five (height of about 11) is selected resulting in one big cluster (45 members) on the left, two small clusters (15 and 10 members) split from the third top-level cluster and two smaller clusters (4 and 8 members) which are the same as the two small clusters with a cut of three.

 

Figure 1: Plot of sdev from PCA

Figure 2: Dendrogram plot of HCA clusters

 

 

Red Cluster

(1)

Green Cluster (2)

Blue Cluster

(3)

Magenta Cluster (4)

Black Cluster (5)

access      17

program   17

software  15

virus        15

affect                    1      applets                 1   

attachment         1      authenticity        2     

break                    1  

complex tasks      1

complexity scale  1 

components        2               

configure             2                  conflict                 1                 

control                 1                    device                  1                   

domains               1                    DOS                      1      

applications   3 connected      8   firewall            3     preferences    4

protection     10  secure             7       security         12    share              6      

 

authorization  1 browser           1       customize        1   

 data                 1       

documents      1     download        3    firewire            1     folders             1     

information     2   internet            2     

key strokes     2    processor        1   

update             2        upload             1        

web address   2 

block             5      code              2       cookies         6  

 files              3      harddrive     4

interface       4 malicious      4  microsoft      7 

OS                 3         privileges     6

 

Table 4: List of words in each cluster

The words in each cluster were then sorted and is presented in Table 4 (only the first fifteen words of cluster Green are shown- the cluster contains 45 words which is too large to present in the table). Before interpreting the results, it is necessary to check that the choice of using only two dimensions was a good choice. A plot of the first three principal components (Figure 3) clearly shows that PC1 separates the Red cluster and PC2 separates the Blue cluster. PC3 does not seem to have any distinct effect on the clusters. So the initial decision to explore only the first two dimensions was a good choice. Next, the locations of the clusters were plotted in the principal component space, with labels “Words” (Figure 4) and “num_ppl” (Figure 5).

Figure 3: Plot of the first three principal components

 

 

Figure 4: Principal Components plot showing the five clusters of Words

Figure 5: Principal Components plot  showing the five clusters of Words with num_ppl

 

Figures 4 and 5 clearly shows that PC1 represents the number of people of who used a word and PC2 most probably represents the degree to which pairs of words were used together. The Red cluster (1) contains words that have been used by many users of the study. The Green cluster (2) and the Magenta cluster (4) consist of many words that were used by one or two users. The Blue cluster (3) and the Black (5) cluster contain words that were used by more than three users with “code” as the only word that was used by less than three users.

 

The Red cluster consisting of software, program, virus and access is a group of words that users used to describe most of the tasks and technical terms. This group can be characterized as the “most frequently used”. At this point, an analysis of the contexts in which the words were used may explain the coherence of this group. “Program” and “Software” were used in all the six categories (see Table 1 for Category). “Access” was used in three of the categories indicating access to the computer, website, network, files on the network, software and accounts. This suggests that most users think of the computer in terms of software, program and access to different things on the computer. “Virus” was used in four of the categories to describe anything that was bad for the computer- virus can exist in software and website, virus can be downloaded from the internet, virus can be protected by a firewall (though this is an incorrect notion) and virus can affect the computer if it is not configured appropriately. This is a positive evidence that a virus was used to describe anything that could be harmful for the computer.

 

The Magenta cluster, although consisting of words used by few users, has a distinct characteristic. Many of the words in this group are data and web oriented- data, information, documents, folders, internet, browser, upload, download, update and web address.  Since the web is composed of data, this group can be labeled as the “data” group which signifies that some users think of the computer in terms of data. However, the words customize, authorization, processor, key-strokes and firewire (invented by a user to describe network port) do not fit the “data” theme and thus cannot be labeled.

The Green cluster has no theme. It contains many different kinds of words that cannot be characterized.  Both the Blue and the Black clusters mostly contain words that are security-related. Some of the words in the Blue group are network related- connected, firewall and share whereas the Black group is comprised of some hardware related words – OS (Operating System), hard drive, interface (used to describe OS and hardware).

 

Since these two groups are difficult to characterize, an examination of the contexts of these words might shed some light on their character. Context analysis shows that there are some users who are more concerned about security than others. Although everyone used words related to security, many users did not use security words unless they faced a security-related task or term. Those who used words like “block” (in the Black cluster) for limiting the children’s account type and “secure” (in the Blue cluster) for secure passwords or computer name which were not explicitly mentioned in the task or the exit interview are probably more concerned about security than others. This assessment, however, is not substantial due to the existence of some non-security words, making characterization of the clusters difficult.

 

Conclusion

The statistical analysis together with the analysis of the contexts under which the words were used shows the following:

1)         The five most frequently used words are- software, program, access, virus and security

2)         For most users, the computer is all about software, program and access to different things (e.g. computer, network, website, software)

3)         Anything threatening to the computer is a virus

4)         Some people think of the computer in terms of data- data, documents, internet, information, upload and download

 

Most users conceptualize the computer in terms of software and access to different things on the computers. Hence, it is important to realize that when helping general users with computer-related problems technical terms should not be used without explaining their meaning in words that the user would understand. Furthermore, when helping users to protect their computers from worms or malware, users should be made aware of their differences from a virus and their security implications. The most interesting result was the existence of a user group that conceptualized the computer in terms of data with an Internet framework.

 

References

 

[1] Marlow, C., Naaman, M., Boyd, D., and Davis, M. 2006. HT06, tagging  paper, taxonomy, Flickr, academic article, to read. Proceedings of 17th Conference on Hypertext and Hypermedia, 31-40.

 

[2] Mutsuddi, A., Jacob, B., Sun, Y., Connelly, K., and Gupta, M. 2007. A Case Study in using Design Principles for Secure Operating Systems Interfaces, Under Submission.

 

[3] Paolillo, J.; Mercure, S.; and Wright, E. 2005. The Social Semantics of LiveJournal FOAF: Structure and Change from 2004 to 2005.  Proceedings of the International Semantic Web Conference, Workshop on  Semantic Network Analysis; G. Stumme, B. Hoser, C. Schmitz, and H  Alani, eds.

 

[4] Paolillo, J. and Penumarthy, S. 2007. The Social Structure of Tagging  Internet Video on del.icio.us. 40th Annual Hawaii International Conference on System Sciences. Los Alamitos, CA.

 

[5] Paolillo, J and Wright, E. 2007. Social Network Analysis on the Semantic Web: Techniques and Challenges for Visualizing FOAF. In V. Geroimenko and S. Chen, eds., Visualizing the Semantic Web. Berlin:  Springer Verlag.

 

[6] Sasse, M. A. 1997. Eliciting and describing user’s models of computer systems. PhD thesis. School of Computer Science. University of Birmingham.

 

[7] "R" Statistical Programming Language. http://www.r-project.org.

 


 

[1] Joint-exploration utilizes deception to have the participant complete tasks with another “participant”, who is actually a researcher. [2]