C212 Assignment 5

Data Mining

Due: Thursday, February 17th, 9:00 AM

Objective: Gain experience with loops, string searching, and command argument handling -- and all this in the context of learning how to do simple HTML data mining. There is a world of data out there, but you have much more power to use it if you can automatically distill from it just what you are looking for.

Pair programming with the partner of your choice (in your lab), or individual work if you prefer.

In lab

  1. Create a directory for this assignment as usual and download to it GetText.java and EbertRecommends.java.
     
  2. Study the comments and main method of the GetText class. Ignore the code in the other methods for now: it involves I/O programming that you will understand later in the course. There are some new things going on in the main method we will learn about now.

    Recall from assignment 1 that a command executed from an operating system shell (command-line interpreter) may be passed arguments. For example, in the command jar -xf a1.jar the strings -xf and a1.jar are argument to the jar command. The arguments are separated by whitespace (one or more spaces or tabs). This is similar to passing arguments to a function, except that in this case the arguments are always strings, and their number may vary. Commands with optional arguments are very common. You also learned in assignment 1 how to invoke a Java class that contains a public static void main(String[] args) method as an application, but in that case no arguments were passed to the Java application.
     
  3. The GetText class may be invoked as an application with an optional argument, which is a web URL, as indicated by the "Usage" line of its documentation. Start a shell, change to your new assignment directory, and compile the classes you downloaded above using the javac command. (Refer back to assignment 1 if you need a refresher on how to do such things with the shell.) Then execute the command java GetText. It prints the HTML text at the URL http://www.cs.indiana.edu/~chaynes/test.html.
     
  4. Next click http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=REVIEWS05. The EbertRecommends program extracts the recommended film titles from this page. This is an example of data mining.
     
  5. To see the HTML text that generated this command you could use the source code viewing command of your browser, or in an operating system shell (not DrJava) invoke the command

    java GetText rogerebert.suntimes.com/apps/pbcs.dll/section?category=REVIEWS05


    Try it. Here the GetText command is taking the URL as an argument (less the optional http:// prefix).
     

  6. Look again at the code of the method GetText.main. It uses args.length to obtain the length of the argument array in order check whether the optional URL argument is present. If it is, the argument is retrieved as the value of the expression args[0]. You will shortly learn all about arrays, but for now you just need to know that you can get the length of the array as if it had a public field named length and retrieve the value of a given element of the array using the square bracket syntax with an index number. Array indexing, like string indexing, is zero based, so the index values range from zero to one less than the length.

    In general, the parameter of the main method of a Java application is an array of strings containing the arguments passed to the application when it is invoked from a shell. These arguments appear after the class name when it is invoked using the java command.
     
  7. Now invoke the EbertRecommends class as an application with no arguments. Note that it prints a list of the names of Ebert's recommended movies, extracted from the HTML text you have just viewed in its raw form as well as formatted by your browser.
     
  8. To see how this works, it helps to inspect the HTML text in an editor. To create a file containing the HTML text, execute the java command in step 5 above, but with "> recommends.html" added to the end of the command. In Windows and Unix shells the greater-than operator is used to redirect standard output to the file named after it. If you do not specify the directory into which the file goes, it will be the current directory of the shell. So if you are not already there, you will want to cd to your assignment directory before executing this command.
     
  9. Look at the main method of EbertRecommends.java. First it gets the HTML text of the web page in a string using the GetText.fromUrl method and issues an error message it fails for some reason. Note that it prints the message to Standard.err, not Standard.out. The standard error and standard output streams are usually the same, but not if the standard output stream is redirected, as you have just learned to do. In such cases you do not want error messages to appear in the output file where they might not be noticed, so generally standard error should be used to print error messages.
     
  10. Next it searches for the string Ebert Recommends in the text. To see the effect of this, open the file recommends.html in DrJava (or another plain-text editor) and search for this string using the Find/Replace command. (If your text editor's search command is not case sensitive, it will first find a lower-case version of this string, in which case you need to search again.) Note the line number. The program stores this position in the text string in the variable begin. Next the program finds and records the position of </td> in the text and save it in the variable endRecs. Note that these positions are shortly before and after the text that contains the movie recommendations. You do not need to understand HTML to be able to find things to search for that bracket point of interest in the text. You can confirm that these points surround all the movies of interest by looking at the page in your browser, finding the names of the first and last movies in the list, and searching for them in the HTML text.
     
  11. Next the program enters a loop that searches for the next movie in the list, isolates its name in the HTML markups, and prints each movie title found before repeating the loop. When the search for the next title fails, the program breaks out of the loop. Work through this process by hand for one or two titles by simulating the program searches in your recommends.html editor window.

    You do not need to know much of anything about HTML to do this sort of data mining, though it helps to know that it involves the use of tags surrounded by angle brackets, which you probably knew already.

    By working patiently through all the steps so far, you hopefully have learned the context you need for this assignment. Now it's your turn to do a bit of programming of the sort you will need for this assignment.
     
  12. Notice in your browser view of this HTML file that there are stars before the movie titles indicating the reviewer's rating of the movie (these are favorites, so they are all rated at least 3, but some are 3 1/2 or 4 star movies). We would like the summary movie list to print after teach title the number of stars the movie was awarded. Note the stars are displayed as images with the img tag markup. What are the names of the full and half-star images?
     
  13. You would probably first think of adding to the main method a loop that counts the stars, but this is a special case of a very general string operation. It is common to need to know how many times a given substring appears in another string. So instead write a (static) method named count that takes two strings as arguments and returns the number of non-overlapping occurrences there are of the second string in the first. Test your method in the DrJava interaction window. For example:
    > EbertRecommends.count("The cat in the hat", "at")
    2
    > EbertRecommends.count("ababa", "aba") // overlapping test
    1
    
  14. Now in the main method, pull out a substring containing all the stars for a given movie. It does not have to be exactly the stars section; it just cannot contain stars for another movie. Then use your new count method to determine the number of stars. You can also use it to determine if there are zero or one half-star images. Then print the number of stars after the movie title. The output should start with
    The Polar Express 4.0
    Dr. Strangelove (Restored Version) 4.0
    Aliens of the Deep 3.0 
    
  15. If you have time, add to this application optional arguments to specify that the text of this HTML page is to be loaded from an indicated file, instead of from the web. With this option added, the usage summary for the application would be: Usage: java EbertRecommends [-f file]. As usual in syntax notation, the square brackets indicate optional content. Thus since the text for this HTML page has been stored in recommends.html, you could invoke the application with java EbertRecommends -f recommends.html and get the same results as you did without the optional arguments (assuming the web page has not changed since you captured its text in the file. The method GetText.fromFile.
     
  16. If the arguments do not conform with this syntax (there should be zero or two arguments, and if two, the first should be -f), print an error message to standard error.
     
  17. When you have completed this, or there are less then 15 minutes left in the lab, add as always if you have not done so already your first-line comment with both your usernames, and submit your EbertRecommends.java file via Vincent as lab5.

Assignment

Write an application named DataMine that performs at least two different data mining operations. You may choose any web data mining operations you like, subject to the following constraints:

  1. Do not use the same web page as the lab exercise above.
  2. Include at least three loops involving string searches in your program, not including the count method, which you may wish to borrow from your lab work or the posted lab solution.
  3. There is some plausible interest in the data you mine. Avoid trivial tasks such as extracting the title from a page.
  4. Adhere to customary norms of decency!

If you have any doubts about the appropriateness of a given mining task, ask your lab instructor or the course instructor.

Hopefully you will have fun mining for information that you have some interest in. Automatic data mining like this makes sense when there is a large amount of data to be extracted, when the data changes often and you want to get the latest information regularly, or when data from multiple sources are to be combined automatically in some intelligent fashion. (For example, one might want to be alerted on a weekly basis of any movies that your favorite reviewer highly recommends that are playing in local theaters. You don't know quite enough Java yet to do this kind of combining of information, but you will soon, and of course such combining is not part of this assignment.)

Print the results of your data mining to standard output (using System.out.println and/or System.out.print). The result may be a single line or many lines.

Your application should have command-line arguments that specify which mining operation to perform and optionally indicate a file to read the text from, instead of by default reading it from the web. Describe what your application does in the main method comment, including a usage summary. It is customary for arguments that identify options to start with a dash in Unix systems (and a slash in Windows). There is also a tradition of extreme brevity in command line option names, often limiting them to a single letter, as in -f, though more recently the trend is for longer, more helpful, names. Of course these names cannot contain spaces, for then they would not all be in one argument. For example, if the Ebert recommendations program were extended to also mine for his recent reviews, its usage line might become

Usage: java EbertRecommends {-recommends | -recent} [-f file]

Here in addition to the square brackets, we are using the traditional syntax notation of braces for grouping and the vertical bar to indicate a choice (read "or"). So this means the -recommends or -recent argument must be included, optionally followed by -f and a file name.

As in the GetText class and the final version of the EbertRecommends class developed in lab, print to standard error a usage error message if the arguments do not conform to the usage syntax.

To determine your mining operations, start by browsing the web for something of interest. When you have found a page you want to mine, use the GetText class as an application to save the text to a file and examine it. (You can cut and paste the URL.) The text should correspond to the web page you were viewing. In some rare cases it may not, in which case you have to find another page. (This would probably be due to automatic browser forwarding of a kind that the java.net.HttpURLConnection class does not know about.)

Do not forget to include your names and usernames on the first-line comment. (This is the last time you will get this reminder, and you will lose credit if you forget it in future assignments, as well as this one.)

Submit your DataMine.java file in a jar named a5.jar as a5 in Vincent. Your jar file should also include html text files for each of your options. The html file names should be the name of the corresponding option, without the dash, followed by .html. Thus the files for the EbertRecommends program with usage displayed above would be recommends.html and recent.html. Then assuming these are the only .html files in your assignment directory, you could use the command jar -cf a5.jar DataMine.java *.html.

Note: Having these files will allow your program to be tested even if the corresponding web page has changed in a way that breaks your program. Mining for data by simple string searching as in this assignment is particularly prone to such fragility (breakage to do changes in the program's environment). Somewhat more robust web mining can be achieved by software that parses web files according to their html structure, but this approach is considerably harder to use in many cases, and is still rather fragile.

There is a movement called the semantic web to develop standardized ways of representing information on the web that will eventually make a lot of data mining much more stabile and meaningful. In a nut shell, the html web we know today was designed for human consumption, while the semantic web is designed for machine consumption.

There is much research going on at IU and elsewhere to develop standards and tools for the semantic web, but it is still in a very early stage of development. Many think that within a decade it will be apparent that the semantic web is an even more important development than the original html web. Did you ever wonder how exciting it must have been to be involved in the first few years of the development of the web we know today, when only a small number of people knew it was going to be the next big thing? That's how some folks feel about the semantic web today.