Assignment 9
Data Mining
Individual or pair work in this assignment.
In Lab
Begin by downloading to your assignment directory the module file http.py.
Visit http://www.weather.com, enter a zip code in the Local Weather text box near the top, and click the go button. For IU's 47405 zip code, the new URL is
http://www.weather.com/weather/local/47405?whatprefs=&what=WeatherLocalUndeclared&lswe=&lswa=&from=whatwhere&where=47405&wxGoButton=GO
Use this URL (or one for another zip code if you like) to create, in a file named feellike.py, a program that prints what the current temperature feels like, using the population.py datamining program development in class as a guide. Submit your program before the end of the lab as usual.
Then start work on the assignment below.
Assignment
Develop a Python application named datamine.py that when invoked does some web data mining and prints the results.
You choose what data you will be mining for, within reasonable limits. Do not choose something that is too difficult to complete with a reasonable amount of effort. (Don't get so excited about working on this project that your other course work suffers, including time to review and practice with a broader range of material from this course.) Also do not choose a project that is too easy, such as the population exercise above, or the slightly harder temperature program presented in class. The favorite movies example from class is representative of the level of data-mining difficulty expected.
In almost all cases you should be extracting multiple pieces of data from different locations in a single source file using at least one necessary loop. Otherwise your program may be considered to be so simple that it is worth at most a C. In every case your data mining should require several string searches. One or more of the best submissions from the class will be posted in the Oncourse Assignment Solutions and receive extra credit.
It is a good idea to ask your lab instructor if you are in doubt whether your project plan is too hard or too easy.
Hopefully you will have fun with the flexibility this provides, and perhaps make a program that's genuinely useful to you now. If not, you will know how to do this sort of thing should you find it useful in the future. This flexibility may, however, present difficulties you cannot anticipate. So be sure to leave enough time to get help from one of the course instructors if need be. Also, pay attention to the potential difficulties that have been anticipated and mentioned in class and the class notes.
No two teams may use the same host name in the URL used for data mining. When you have found a URL you would like to use, click on the A201 Oncourse Forums too, select the Assignment 8 forum and Host names topic. Check for a thread with the host name in its title line. The host name is the part of the URL between http:// and the following /. This search is easily automated by using the browser's Edit > Find command (ctrl-F in Firefox and IE). If no other team has claimed it, click Post new thread and in the title put the host name to claim it (the message does not need text in the body). If you should later decide not to use that host, delete the thread.
Your datamine.py file should contain a documentation string after the initial identifying comment line required in all assignments and before the statement importing the http module. Briefly state in these comments what your program does (not the details of how it does it).
Elsewhere in your program use comments to describe in general terms how the data mining is accomplished: specifically, what does the program look for and how it is located. See the ebert_movies.py program from class as an example of suitable documentation.
Use a web browser to find some information you would like to mine. Cut and paste the URL of the web page with the desired information into a simple Python program that just prints the text returned by a server for this URL, as in the first temperature data program presented in class. This text should be the same as you see if you view the HTML source with a browser command. If so, check that the URL host has not been claimed and claim it via an Oncourse message as above. If not, you have run into one of the limitations noted in this week's class notes and have to use another URL.
Next, find the data you want to mine by searching for some of its text in the HTML source with a text editor. Develope a strategy for finding the strings of interest, document it, and extend the simple program above that printed all the source text so it only prints the text you are mining.
When you have completed your program, temporarily add back to it the statement that prints all the HTML source. Cut and paste this source into a plain-text editor and save it to a file named source.html. Then remove this print statement before submitting your work.
Submit the final version of your datamine.py and source.html files via Oncourse as usual.
The reason for submitting the source file is that if your program does not work when your lab instructor is grading it, due to a change in the web server, your instructor will modify the program to read source.html instead of using http.get.