MIME-Version: 1.0 Content-Location: file:///C:/90861A51/Week11.htm Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="us-ascii" Week 11

Week 11
Internet Data Mining

I= ndiana <= st1:PlaceType w:st=3D"on">University

Computer Science A202 / A598<= /p>

and Informatics I211

 

This week’s success strategy

<= ![if !supportLists]>n   Preparing for the final and quizzes

<= ![if !supportLists]>q   the= final will be similar to three quizzes in a row in duration, content, and contribution to your grade

<= ![if !supportLists]>q   kno= wledge of the syntax and semantics of basic Python features is required=

<= ![if !supportLists]>n     most= are summarized in Appendix A of your text

<= ![if !supportLists]>n     if a question requires knowledge of functions and methods not in appendix A, you will be given them

<= ![if !supportLists]>q   kno= wing how to use these features creatively to solve simple problems

<= ![if !supportLists]>n     for = this there is no substitute for practice, obtained both through assignments and additional programming exercises in the book

<= ![if !supportLists]>q   some additional information infrastructure concepts emphasised in class

 

This week

<= ![if !supportLists]>n   Plan for remaining classes and assignments

<= ![if !supportLists]>n   URLs and related Internet basics

<= ![if !supportLists]>n   Network connections

<= ![if !supportLists]>n   HTTP server requests

<= ![if !supportLists]>q   ret= rieving web pages as Python strings

<= ![if !supportLists]>n   Elementary web data mining

<= ![if !supportLists]>q   rec= ognizing patterns in HTML

<= ![if !supportLists]>q   ext= racting information from HTML text

 

 

Uniform Resource Locators (UR= Ls)

<= ![if !supportLists]>n    Unif= orm Resource Locators (URLs) identify resources on the Internet=

<= ![if !supportLists]>q   they= can be used for much more than identifying resources provided by Internet serve= rs

<= ![if !supportLists]>q   exam= ples include resources for file transfer, email, and newsgroups, as well as web access

<= ![if !supportLists]>n    URL= s are an instance of the more general notion of a Uniform Resource = Identifier (URI)

<= ![if !supportLists]>q   URIs= may indicate anything !

<= ![if !supportLists]>q   they= are location independent (may not be identified with any server)

<= ![if !supportLists]>q&nb= sp;  URIs are the building blocks of the Reso= urce Description Framework (RDF), which is poised to become the foundation of the semantic web

<= ![if !supportLists]>q   the semantic web aims to make Internet content understandable to computers, as = the existing hypertext web has made it understandable by humans

Web URLs

<= ![if !supportLists]>n     Simp= lified syntax: http://<host>[:<port>][/<query&= gt;]

<= ![if !supportLists]>n     Exam= ples

<= ![if !supportLists]>q&nb= sp;   oncourse.iu.edu

<= ![if !supportLists]>q&nb= sp;   http://www.cs.indiana.edu/~chaynes/test.t= xt

<= ![if !supportLists]>q&nb= sp;   http://www.weather.com/outlook/driving/lo= cal/47408?lwsa=3DWeatherLocalUndeclared&lswe=3D47408<= /b>

<= ![if !supportLists]>n     http= :// ind= icates that the URL is using the hyper-text transfer p= rotocol used to exchange HTML documents on the web

<= ![if !supportLists]>n     <= host> identifies a server (or cluster of servers) with a dot-separated sequence of domain names

<= ![if !supportLists]>q    not = case sensitive

<= ![if !supportLists]>n     The <port> number identifies the application on the server that is to res= pond

<= ![if !supportLists]>q    defa= ult for web servers is 80

<= ![if !supportLists]>n     The <query> indicates any additional information the server needs to loca= te the resource

<= ![if !supportLists]>q    may = be a file system path, database query, or anything else (with standard characters and no whitespace) that the server understands

<= ![if !supportLists]>q    may = be case sensitive

HTML

<= ![if !supportLists]>n    A <= b>markup language provides a way of adding notations to text for some purpose

<= ![if !supportLists]>n    The= Hype= r-Text Markup Language (HTML) is used to add markups to web documents

<= ![if !supportLists]>n    HTM= L is derived from the more general Extensible Markup Langua= ge (XML)

<= ![if !supportLists]>q   XML = has become the basis for many data exchange formats

<= ![if !supportLists]>n     RDF = is also based on XML

<= ![if !supportLists]>q   XML markups use tags delimited by matching angle brackets, <…= ;>

<= ![if !supportLists]>n    HTM= L's most important use is to support hyper-text, the linking of words or phrases of one document to another document

<= ![if !supportLists]>q   samp= le syntax: <a href=3D"url">linked text= </a>

<= ![if !supportLists]>n    HTM= L also provides a variety of formatting tags to indicate document appearance and s= ome content identification markups

<= ![if !supportLists]>q   exam= ples: <b> to begin bold font, <p> to begin a paragraph

Data mining=

<= ![if !supportLists]>n   Data mining refers to the process of extracting useful information from vast amounts of data

<= ![if !supportLists]>n   The web is an incredible supply of data to mine

<= ![if !supportLists]>n   Web data mining often uses specialized software for parsing and navigating data in XML format

<= ![if !supportLists]>n   Some useful data mining can be accomplished using the simple string manipulation techniques we have learned

<= ![if !supportLists]>q   this approach is more fragile than using XML tools: more likely to stop working = if the format of the web page is changed a bit

<= ![if !supportLists]>q   mos= t HTML data mining is fragile even if XML-based because HTML was intended for data display, not automated searching

<= ![if !supportLists]>n     RDF = is intended to solve this problem

Internet server connections

<= ![if !supportLists]>n   Many internet protocols, including http, are connection-oriented

<= ![if !supportLists]>n   A network connection is a relationship between a server and a client for use by an individual application

<= ![if !supportLists]>q   it = is analogous to a telephone connection in that it is two-way (read and write, vs talk and listen) and temporary (open and close, vs dial and hanging up)

<= ![if !supportLists]>q   but network connections are logical: managed by software using network resources that are not connection-oriented

<= ![if !supportLists]>n     tele= phone connections are physical, using switches (when analog telephone service is used)

<= ![if !supportLists]>n   The provided http.get(url) function returns in a string the server's response for the (web) url

http mo= dule

# http.= py, by chaynes@indiana.edu

import urllib

&n= bsp;

def get(url):

    '''Reads and returns a= s a string the response for url

    of the form http://HOST[:PORT][/QUERY]. PORT defaults to

    80, and QUERY defaults= to empty.'''

    while True:=

        connection =3D urllib.FancyURLopener({}).open(url)=

        text =3D connection.read()

        connection.close()

        if text.count('<meta http-equiv=3D"Refresh"') =3D=3D 1:

        =     _, rest =3D text.split('URL=3D', 1)

        =     url, _ =3D rest.split('"', 1)

        else: # the page has not meta-moved

        =     break

    return text=

&n= bsp;

http mo= dule test

def mai= n():

    assert get('www.cs.indiana.edu/~chaynes/test.txt') \

        =    =3D=3D 'testing\none two\n'

    print 'OK'<= /span>

&n= bsp;

if __na= me__ =3D=3D '__main__':

    main()

 

Getting the Bloomington temperature, part 1

<= ![if !supportLists]>n     Visi= t www.weathe= r.com and= enter zip code 47405 to get the temperature

<= ![if !supportLists]>n&nb= sp;    Next (assuming IE is our browser) we run the View > Source command

<= ![if !supportLists]>n     Then= we cut the URL from the IE address window into the following program skeleton, run= it, and compare the printed text with the IE source (they should be the same)

 

# temp.= py, by chaynes@indiana.edu

import = http

&n= bsp;

tempUrl= =3D 'http://www.weather.com/outlook/driving/local\

/47408?= lwsa=3DWeatherLocalUndeclared&lswe=3D47408'

&n= bsp;

def getTemp():

    text =3D http.get(temp= Url)

    return text=

&n= bsp;

def mai= n():

    print getTemp()

&n= bsp;

if __na= me__ =3D=3D '__main__':

    main()

temp.py pro= gram, part 2

<= ![if !supportLists]>n     How = can we locate the temperature?

<= ![if !supportLists]>n&nb= sp;    One possible answer: between obsTempTextA>= ; and the following &

<= ![if !supportLists]>n     How = can we modify our program to extract the temperature substring?

# temp.= py, by chaynes@indiana.edu

import = http

&n= bsp;

tempUrl= =3D 'http://www.weather.com/outlook/driving/local\

/47408?= lwsa=3DWeatherLocalUndeclared&lswe=3D47408'

&n= bsp;

def getTemp():

    text =3D http.get(temp= Url)

    _, rest =3D text.split('obsTempTextA>')

    temp, _ =3D rest.split('&', 1)

    return temp=

&n= bsp;

def mai= n():

    print getTemp()

&n= bsp;

if __na= me__ =3D=3D '__main__':

    main()

&n= bsp;

Ebert's recommended movies

<= ![if !supportLists]>n   View Roger Ebert's recommended movie list at http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=3DREV= IEWS05

<= ![if !supportLists]>n   How can we locate the movie list in the source text?

<= ![if !supportLists]>n   One possible answer

<= ![if !supportLists]>q&nb= sp;  they are between Ebert Recommends and </td>

<= ![if !supportLists]>n   How can we locate each successive movie title?

<= ![if !supportLists]>n   One possible answer

<= ![if !supportLists]>q   each movie name is between > and < characters following h= ref

<= ![if !supportLists]>n   How can we program this?

ebertMovies.py

# ebertMovies.py, by chaynes@indiana.edu

import = http

&n= bsp;

ebertRe= commendsUrl =3D 'http://rogerebert.suntimes.com/apps/pbcs.dll/\

section= ?category=3DREVIEWS05'

&n= bsp;

def recommendations(): # documentation omitted for slide formatting =

    text =3D http.get(eber= tRecommendsUrl)

    begin =3D text.index('= Ebert Recommends')

    endRecs =3D text.index('</td>', begin)

    movieNames =3D []=

    while True:=

        i =3D text.find('href', begin)

        if i =3D=3D -1 or i > endRecs: break

        begin =3D text.find('>', i) + 1

        end =3D text.find('<', begin)

        name =3D text[begin : end]

        movieNames.append(name)

    return movieNames=

&n= bsp;

def mai= n(): print '\n'.join(recommendations())

main()<= o:p>

Some limitations of this data mining methodology

<= ![if !supportLists]>n    Bro= wser source and program connection text might not be the same in exceptional circumstan= ces such as a forwarding technique not known to http.get

<= ![if !supportLists]>q   emai= l me the URL if you find this

<= ![if !supportLists]>q   you = can probably find a redirection URL in the text and use it

<= ![if !supportLists]>n    Avo= id web sites that require you to login

<= ![if !supportLists]>n    Avo= id web sites that use frames (such as this course's web)

<= ![if !supportLists]>q   the = text will be brief with several mentions of frame tags<= /p>

<= ![if !supportLists]>n    Avo= id web sites that display the information you want via client-side programs

<= ![if !supportLists]>q   exam= ple using Javascript: http://math.berkeley.edu/~galen/popclk.html

<= ![if !supportLists]>q   exam= ple using an applet: http://www.ibiblio.org/lunarbin/worldpop

<= ![if !supportLists]>q   clie= nt-side programs are ok if they are not generating your information

<= ![if !supportLists]>n     exam= ple, Ebert's web uses Javascript, but not to generate movie names

<= ![if !supportLists]>q   info= rmation generated by server-side programs is fine

<= ![if !supportLists]>n     exam= ple: http://www.census.gov/cgi-bin/ipc/popclockw

<= ![if !supportLists]>n    Fra= gility, as mentioned earlier

The End

 

ebertMovies.py

# ebertMovies.py, by chaynes@indiana.edu

import http

ebertRecommendsUrl =3D 'http://rogerebert.suntimes.com/apps/pbcs.dll/\
section?category=3DREVIEWS05'

def recommendations():
    '''Get the text of Roger Ebert's recommendations page and return
    a list of the recommended movie titles. This is done by searching
    between the first occurrence of 'Ebert Recommends' and the next
    '</td>' for movie names. Names start after the first '>' after
    'href' and end with the following '<'.'''
    text =3D http.get(ebertRecommendsUrl)
    begin =3D text.index('Ebert Recommends')
    endRecs =3D text.index('</td>', begin)
    movieNames =3D []
    while True:
        i =3D text.find('href', begin)
        if i =3D=3D -1 or i > endRecs: break
        begin =3D text.find('>', i) + 1
        end =3D text.find('<', begin)
        name =3D text[begin : end]
        movieNames.append(name)
    return movieNames

def main():
    print '\n'.join(recommendations())

if __name__ =3D=3D '__main__':
    main()

http.py

# http.py, by chaynes@indiana.edu

import urllib

def get(url):
    '''Reads and returns as a string the response for url
    of the form http://HOST[:PORT][/QUERY]. PORT defaults to
    80, and QUERY defaults to empty.'''
    while True:
        connection =3D urllib.FancyURLopener({}).open(url)
        text =3D connection.read()
        connection.close()
        if text.count('<meta http-equiv=3D"Refresh"') =3D=3D 1:
            _, rest =3D text.split('URL=3D', 1)
            url, _ =3D rest.split('"', 1)
        else: # the page has not meta-moved
            break
    return text

def main():
    assert get('http://www.cs.indiana.edu/~chaynes/test.txt') \
           =3D=3D 'testing\none two\n'
    print 'OK'

if __name__ =3D=3D '__main__':
    main()

temp.py

# temp.py, by chaynes@indiana.edu

import http

tempUrl =3D 'http://www.weather.com/outlook/driving/local/47408\
?lwsa=3DWeatherLocalUndeclared&lswe=3D47408'

def getTemp():
    text =3D http.get(tempUrl)
    _, rest =3D text.split('obsTempTextA>')
    temp, _ =3D rest.split('&', 1)
    return temp

def main():
    print getTemp()

if __name__ =3D=3D '__main__':
    main()