MIME-Version: 1.0 Content-Location: file:///C:/90861A51/Week11.htm Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="us-ascii"
Week 11
Internet Data Mining
Computer Science A202 / A598
and Informatics I211
This week’s success
strategy
<=
![if !supportLists]>n
Preparing for the final and quizzes
<=
![if !supportLists]>q the=
final
will be similar to three quizzes in a row in duration, content, and
contribution to your grade
<=
![if !supportLists]>q kno=
wledge
of the syntax and semantics of basic Python features is required
<=
![if !supportLists]>n most=
are
summarized in Appendix A of your text
<=
![if !supportLists]>n if a
question requires knowledge of functions and methods not in appendix A, you
will be given them
<=
![if !supportLists]>q kno=
wing
how to use these features creatively to solve simple problems
<=
![if !supportLists]>n for =
this
there is no substitute for practice, obtained both through assignments and
additional programming exercises in the book
<=
![if !supportLists]>q some
additional information infrastructure concepts emphasised in class
This week
<=
![if !supportLists]>n
Plan for remaining classes and assignments
<=
![if !supportLists]>n
URLs and related Internet basics
<=
![if !supportLists]>n
Network connections
<=
![if !supportLists]>n
HTTP server requests
<=
![if !supportLists]>q ret=
rieving
web pages as Python strings
<=
![if !supportLists]>n
Elementary web data mining
<=
![if !supportLists]>q rec=
ognizing
patterns in HTML
<=
![if !supportLists]>q ext=
racting
information from HTML text
Uniform Resource Locators (UR=
Ls)
<=
![if !supportLists]>n U=
span>nif=
orm Resource
Locators (URLs) identify resources on the Internet
<=
![if !supportLists]>q they=
can
be used for much more than identifying resources provided by Internet serve=
rs
<=
![if !supportLists]>q exam=
ples
include resources for file transfer, email, and newsgroups, as well as web
access
<=
![if !supportLists]>n URL=
s are
an instance of the more general notion of a Uniform Resource =
Identifier
(URI)
<=
![if !supportLists]>q URIs=
may
indicate anything !
<=
![if !supportLists]>q they=
are
location independent (may not be identified with any server)
<=
![if !supportLists]>q&nb=
sp;
URIs are the building blocks of the Reso=
urce
Description Framework (RDF), which is poised to
become the foundation of the semantic web
<=
![if !supportLists]>q the
semantic web aims to make Internet content understandable to computers, as =
the
existing hypertext web has made it understandable by humans
Web URLs
<=
![if !supportLists]>n Simp=
lified
syntax: http://<host>[:<port>][/<query&=
gt;]
<=
![if !supportLists]>n Exam=
ples
<=
![if !supportLists]>q&nb=
sp;
oncourse.iu.edu
<=
![if !supportLists]>q&nb=
sp;
http://www.cs.indiana.edu/~chaynes/test.t=
xt
<=
![if !supportLists]>q&nb=
sp;
http://www.weather.com/outlook/driving/lo=
cal/47408?lwsa=3DWeatherLocalUndeclared&lswe=3D47408
<=
![if !supportLists]>n http=
:// ind=
icates
that the URL is using the hyper-text transfer p=
rotocol
used to exchange HTML documents on the web
<=
![if !supportLists]>n <=
host>
identifies a server (or cluster of servers) with a dot-separated sequence of
domain names
<=
![if !supportLists]>q not =
case
sensitive
<=
![if !supportLists]>n The
<port> number identifies the application on the server that is to res=
pond
<=
![if !supportLists]>q defa=
ult
for web servers is 80
<=
![if !supportLists]>n The
<query> indicates any additional information the server needs to loca=
te
the resource
<=
![if !supportLists]>q may =
be a
file system path, database query, or anything else (with standard characters
and no whitespace) that the server understands
<=
![if !supportLists]>q may =
be
case sensitive
HTML
<=
![if !supportLists]>n A <=
b>markup
language provides a way of adding notations to text for some purpose
<=
![if !supportLists]>n The=
Hype=
r-Text
Markup Language (HTML) is used to add markups to web
documents
<=
![if !supportLists]>n HTM=
L is
derived from the more general Extensible Markup Langua=
ge (XML)
<=
![if !supportLists]>q XML =
has
become the basis for many data exchange formats
<=
![if !supportLists]>n RDF =
is also
based on XML
<=
![if !supportLists]>q XML
markups use tags delimited by matching angle brackets, <…=
;>
<=
![if !supportLists]>n HTM=
L's
most important use is to support hyper-text, the linking of words or
phrases of one document to another document
<=
![if !supportLists]>q samp=
le
syntax: <a href=3D"url">linked text=
</a>
<=
![if !supportLists]>n HTM=
L also
provides a variety of formatting tags to indicate document appearance and s=
ome
content identification markups
<=
![if !supportLists]>q exam=
ples: <b>
to begin bold font, <p> to begin a paragraph
Data mining
<=
![if !supportLists]>n
Data mining refers to the process of extracting useful information from vast
amounts of data
<=
![if !supportLists]>n
The web is an incredible supply of data to mine
<=
![if !supportLists]>n
Web data mining often uses specialized software for parsing and
navigating data in XML format
<=
![if !supportLists]>n
Some useful data mining can be accomplished using the simple string
manipulation techniques we have learned
<=
![if !supportLists]>q this
approach is more fragile than using XML tools: more likely to stop working =
if
the format of the web page is changed a bit
<=
![if !supportLists]>q mos=
t HTML
data mining is fragile even if XML-based because HTML was intended for data
display, not automated searching
<=
![if !supportLists]>n RDF =
is
intended to solve this problem
Internet server connections
<=
![if !supportLists]>n
Many internet protocols, including http, are connection-oriented
<=
![if !supportLists]>n
A network connection is a relationship between a server and a
client for use by an individual application
<=
![if !supportLists]>q it =
is
analogous to a telephone connection in that it is two-way (read and
write, vs talk and listen) and temporary (open and close, vs dial and
hanging up)
<=
![if !supportLists]>q but
network connections are logical: managed by software using network
resources that are not connection-oriented
<=
![if !supportLists]>n tele=
phone
connections are physical, using switches (when analog telephone service is
used)
<=
![if !supportLists]>n
The provided http.get(url) function returns in a
string the server's response for the (web) url
http mo=
dule
# http.=
py,
by
import
urllib
def
get(url):
'''Reads and returns a=
s a
string the response for url
of the form
http://HOST[:PORT][/QUERY]. PORT defaults to
80, and QUERY defaults=
to empty.'''
while True:
connection =3D urllib.FancyURLopener({}).open(url)
text
=3D connection.read()
connection.close()
if
text.count('<meta http-equiv=3D"Refresh"') =3D=3D 1:
=
_, rest =3D
text.split('URL=3D', 1)
=
url, _ =3D rest.split('"', 1)
else:
# the page has not meta-moved
=
break
return text
http mo=
dule
test
def mai=
n():
assert
get('www.cs.indiana.edu/~chaynes/test.txt') \
=
=3D=3D 'testing\none two\n'
print 'OK'
if __na=
me__
=3D=3D '__main__':
main()
Getting the Bloomington
temperature, part 1
<=
![if !supportLists]>n Visi=
t www.weathe=
r.com and=
enter
zip code 47405 to get the temperature
<=
![if !supportLists]>n&nb=
sp;
Next (assuming IE is our browser) we run the View
> Source command
<=
![if !supportLists]>n Then=
we cut
the URL from the IE address window into the following program skeleton, run=
it,
and compare the printed text with the IE source (they should be the same)
# temp.=
py,
by
import =
http
tempUrl=
=3D
'http://www.weather.com/outlook/driving/local\
/47408?=
lwsa=3DWeatherLocalUndeclared&lswe=3D47408'
def
getTemp():
text =3D http.get(temp=
Url)
return text
def mai=
n():
print getTemp()
if __na=
me__
=3D=3D '__main__':
main()
temp.py
<=
![if !supportLists]>n How =
can we
locate the temperature?
<=
![if !supportLists]>n&nb=
sp;
One possible answer: between obsTempTextA>=
;
and the following &
<=
![if !supportLists]>n How =
can we
modify our program to extract the temperature substring?
# temp.=
py,
by
import =
http
tempUrl=
=3D
'http://www.weather.com/outlook/driving/local\
/47408?=
lwsa=3DWeatherLocalUndeclared&lswe=3D47408'
def
getTemp():
text =3D http.get(temp=
Url)
_, rest =3D
text.split('obsTempTextA>')
temp, _ =3D
rest.split('&', 1)
return temp
def mai=
n():
print getTemp()
if __na=
me__
=3D=3D '__main__':
main()
Ebert's recommended movies
<=
![if !supportLists]>n
View Roger Ebert's recommended movie list at http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=3DREV=
IEWS05
<=
![if !supportLists]>n
How can we locate the movie list in the source text?
<=
![if !supportLists]>n
One possible answer
<=
![if !supportLists]>q&nb=
sp;
they are between Ebert Recommends and </td>
<=
![if !supportLists]>n
How can we locate each successive movie title?
<=
![if !supportLists]>n
One possible answer
<=
![if !supportLists]>q each
movie name is between > and < characters following h=
ref
<=
![if !supportLists]>n
How can we program this?
ebertMovies.py
#
ebertMovies.py, by
import =
http
ebertRe=
commendsUrl
=3D 'http://rogerebert.suntimes.com/apps/pbcs.dll/\
section=
?category=3DREVIEWS05'
def
recommendations(): # documentation omitted for slide formatting
text =3D http.get(eber=
tRecommendsUrl)
begin =3D text.index('=
Ebert
Recommends')
endRecs =3D
text.index('</td>', begin)
movieNames =3D []
while True:
i =3D
text.find('href', begin)
if i
=3D=3D -1 or i > endRecs: break
begin =3D text.find('>', i) + 1
end
=3D text.find('<', begin)
name
=3D text[begin : end]
movieNames.append(name)
return movieNames
def mai=
n():
print '\n'.join(recommendations())
main()<= o:p>
Some limitations of this data
mining methodology
<=
![if !supportLists]>n Bro=
wser source
and program connection text might not be the same in exceptional circumstan=
ces
such as a forwarding technique not known to http.get
<=
![if !supportLists]>q emai=
l me
the URL if you find this
<=
![if !supportLists]>q you =
can
probably find a redirection URL in the text and use it
<=
![if !supportLists]>n Avo=
id web
sites that require you to login
<=
![if !supportLists]>n Avo=
id web
sites that use frames (such as this course's web)
<=
![if !supportLists]>q the =
text
will be brief with several mentions of frame tags
<=
![if !supportLists]>n Avo=
id web
sites that display the information you want via client-side programs
<=
![if !supportLists]>q exam=
ple
using Javascript: http://math.berkeley.edu/~galen/popclk.html
<=
![if !supportLists]>q exam=
ple
using an applet: http://www.ibiblio.org/lunarbin/worldpop=
a>
<=
![if !supportLists]>q clie=
nt-side
programs are ok if they are not generating your information
<=
![if !supportLists]>n exam=
ple,
Ebert's web uses Javascript, but not to generate movie names
<=
![if !supportLists]>q info=
rmation
generated by server-side programs is fine
<=
![if !supportLists]>n exam=
ple: http://www.census.gov/cgi-bin/ipc/popclockw
<=
![if !supportLists]>n Fra=
gility,
as mentioned earlier
The End
# ebertMovies.py, by chaynes@indiana.edu
import http
ebertRecommendsUrl =3D 'http://rogerebert.suntimes.com/apps/pbcs.dll/\
section?category=3DREVIEWS05'
def recommendations():
'''Get the text of Roger Ebert's recommendations page and return
a list of the recommended movie titles. This is done by searching
between the first occurrence of 'Ebert Recommends' and the next
'</td>' for movie names. Names start after the first '>' after
'href' and end with the following '<'.'''
text =3D http.get(ebertRecommendsUrl)
begin =3D text.index('Ebert Recommends')
endRecs =3D text.index('</td>', begin)
movieNames =3D []
while True:
i =3D text.find('href', begin)
if i =3D=3D -1 or i > endRecs: break
begin =3D text.find('>', i) + 1
end =3D text.find('<', begin)
name =3D text[begin : end]
movieNames.append(name)
return movieNames
def main():
print '\n'.join(recommendations())
if __name__ =3D=3D '__main__':
main()
# http.py, by chaynes@indiana.edu
import urllib
def get(url):
'''Reads and returns as a string the response for url
of the form http://HOST[:PORT][/QUERY]. PORT defaults to
80, and QUERY defaults to empty.'''
while True:
connection =3D urllib.FancyURLopener({}).open(url)
text =3D connection.read()
connection.close()
if text.count('<meta http-equiv=3D"Refresh"') =3D=3D 1:
_, rest =3D text.split('URL=3D', 1)
url, _ =3D rest.split('"', 1)
else: # the page has not meta-moved
break
return text
def main():
assert get('http://www.cs.indiana.edu/~chaynes/test.txt') \
=3D=3D 'testing\none two\n'
print 'OK'
if __name__ =3D=3D '__main__':
main()
# temp.py, by chaynes@indiana.edu
import http
tempUrl =3D 'http://www.weather.com/outlook/driving/local/47408\
?lwsa=3DWeatherLocalUndeclared&lswe=3D47408'
def getTemp():
text =3D http.get(tempUrl)
_, rest =3D text.split('obsTempTextA>')
temp, _ =3D rest.split('&', 1)
return temp
def main():
print getTemp()
if __name__ =3D=3D '__main__':
main()