A202 / I211 Assignment 11

Data mining

Due Friday, December 3rd, 11:55pm
Progress review November 19th, in lab

Pair programming in this assignment if you like. You may choose any partner in your lab, or work on your own if you prefer.

In lab

Part 1: November 12th

  1. Download the http.py module presented in class this week.
  2. In the IDLE, create a population.py file that begins (after the usual submission comment) with importing http.py.
  3. Visit http://www.census.gov/cgi-bin/ipc/popclockw in a web browser and view the source text for the page it is displaying (in IE, use the View > Source command).
  4. Locate in the source text the world population number displayed prominently near the top of the population page and identify surrounding context that you can search for in order to extract the substring of the text containing only the population number.
  5. Add to your population.py file a population URL variable containing the above URL, and a getPopulation method that given the population URL returns the current world population number substring from the response of the population server indicated by the given URL.
  6. Add a main method that tests your program by printing out the population number. (Do not use an assert statement, since the population is constantly changing.)

When you have completed the last exercise above, or 15 minutes before the lab ends, whichever comes first, submit your population.py file as lab 11 in Vincent. Also, before the end of the lab take time to read the assignment below and ask if after some reflection it is not clear what is required.

If you have remaining time, use it to start the assignment below. It will be very helpful if you can agree with your assignment partner on a URL to use for your data mining and confirm that you can retrieve its source text using Python. If you have questions about the appropriateness of a given data mining problem or have difficulty getting the source text in Python you can get the help of your lab instructor before leaving.

Part 2: November 19th

Copy your project.py  file to guiProject.py. Develop in the new file a very simple GUI interface for your project that at least one button that causes the web connection to be made and at least the first line of the mined data to be displayed. Submit your guiProject.py  file as lab 12 using Vincent.

Assignment

Part 1

Develop a Python application named project.py that when invoked does some web data mining and, in part 1 of this assignment, prints the results. In part 2, beginning next week, you will add a GUI interface to your application. This project spans three weeks, one of which includes the Thanksgiving break, so it counts twice as much as other assignments in the course.

Your application may take additional arguments if you would like to parameterize your search, and it should in any case have an optional initial -s switch. If this switch is present, then instead of searching text material obtained from the web when the application is run, it searches a local file named source.html containing the same text obtained from the web in a successful test of your program. (This switch supports an off-line testing mode that may be necessary if your program at first appears broken at the time it is graded because of an unanticipated change in the web host you are using.)

You choose what data you will be mining for, within reasonable limits. Do not choose something that is too difficult to complete with a reasonable amount of effort. (Don't get so excited about working on this project that your other course work suffers, including time to review and practice with a broader range of material from this course.)

Also do not choose a project that is too easy, such as the population exercise above, or the slightly harder temperature program presented in class. The favorite movies example from class is representative of the level of data-mining difficulty expected. In almost all cases you should be extracting multiple peaces of data from different locations in the source file using at least one loop. Otherwise your program may be considered to be so simple that it is worth at most a C. In every case your data mining should require several string searches. The best three submissions from the class will be posted on the class web and receive extra credit equivalent to one grade level above A.

It is a good idea to ask your lab instructor if you are in doubt whether your project plan is too hard or too easy.

Hopefully you will have fun with the flexibility this provides and perhaps make a program that's genuinely useful to you now. If not, you will know how to do this sort of thing should you find it useful in the future. This flexibility does, however, present a couple of difficulties. First, you may run into difficulties that cannot be anticipated. So be sure to leave enough time to get help from one of the course instructors if need be. Also, pay attention to the potential difficulties that have been anticipated and mentioned in class and the class notes on the course web. Second, it is not possible to provide as clear an indication of the grading standard that will be used.

No two teams may use the same host in the URL used for data mining. When you have found a URL you would like to use, search the course message board for a message with the host name in its subject line. This search is easily automated by using the browser's find (on this page) command: ctrl-F in IE. If no other team has claimed it, post a message to claim it with the subject beginning with the word host, followed by your chosen host name (the message does not need text in the body). If you should later decide not to use that host, post a reply to the message saying so. If you choose to use multiple URLs, only one must be unique in this way, but mention overlap with others in your code documentation.

Your project.py file should contain project documentation after the initial identifying comment line required in all assignments and before the statement(s) importing http and any other modules you use. In the first paragraph of this documentation, provide what the user of your application would need to know: what it is for, how to interpret the results if that is not obvious, and any additional information a user would need to know, such as the meaning of command arguments and non-obvious GUI interactions. In a subsequent paragraph or two, describe in general terms how the data mining is accomplished: specifically, what does the program look for and how it is located.

In lab on November 19th your lab instructor will ask for a printed listing of your project file and to see a demonstration of it. Although you may make further refinements to the data mining part of your project after the 19th, you should at that time have a working application of appropriate difficulty. Your instructor will make a note of your progress at that time, which will contribute about 25% to your final grade for the project. This review of your work is analogous to a project milestone review in the business world. Even if you succeed by much last-minute effort in completing a project on time, if your progress at an milestone review is such that your supervisor becomes very worried that the project will not be completed on time, your supervisor with not be entirely pleased with your performance. (Unlike earlier assignments in this course, for which one instructor has evaluated all submissions in the course, in this assignment each lab instructor will evaluate the projects of those teams in their lab.)

Part 2

Develop a GUI interface for the application you developed in part 1 using the techniques presented in class (and possibly others if you choose to explore additional possibilities on your own). Your project.py application then should still take the -s switch, as before, and possibly additional arguments, but rather than printing out its data mining results, a GUI frame is created. The GUI should have a button which causes the application's data mining web access to be repeated, with display of the result in the GUI frame.

In addition to the web access button, your GUI should have at least three other widgets with associated handlers, at least one of which should be an entry widget. Suggestions include buttons to page up and down through a long list if you mine too much data to fit on the screen. The required entry widget could be used to parameterize the data mining within some limits you know your data source allows (e.g. indicate the year for which you are retrieving weather statistics). You could also use an entry widget to indicate a file to which the mined data is to be saved, or to which the source text from the web is to be saved, or read from.

The GUI techniques in the lecture notes that are not on slides with "Advanced" in the title are adequate for this assignment. If you wish, you are of course encouraged to use the advanced features in the notes and even explore other possibilities of the Tkinter library using additional references, such as those linked from the course web's resources page.

When you are done, submit your final project.py file as a11 using Vincent. Also submit as a11 source using Vincent a file named source.html containing text that may be used with the -s application switch.