On this page:
Introduction
A Data Definition for Web Pages
Extracting Links
Random Surfing
6.10.1

Lab 10: PageRank

Introduction

In this lab, you’ll be working towards implementing a version of Google’s famous PageRank algorithm, which revolutionized the way people find things on the internet. At its core, PageRank is a process that makes it possible to calculate a kind of "authority" rating for web pages based only on the information about which pages link to which other pages.

If you want to see a good overview of how the PageRank algorithm works, now would be a good time to take a look at the book "Nine Algorithms That Changed the Future". Chapter 3 is all about the PageRank algorithm. As described in the book, PageRank measures the authority of web pages by imagining a "random surfer" who moves from web page to web page by following links randomly. The more often the random surfer lands on a web page, the more authority it gets.

While the PageRank algorithm only cares about which pages link to which other pages, the structure of a web page itself is considerably more complex than just a list of links, and we’ll also be talking about how to represent that structure using mutually recursive data definitions and how to extract the information we need in order to use PageRank.

A Data Definition for Web Pages

The representation we use here is a considerably simplified version of true web pages. It’s simplified even more than the presentation that appears in How To Design Programs. But it’s got enough of the important features to be interesting.

Exercise 1 Open a simple web page (like our syllabus) in a browser. Look for a command named like "Developer Tool" or "Element Inspector" and use it to see how a web page is made of recursive parts. What is a link made of?

For our purposes, the content of a Web site will consist of a list of nodes. We’ll consider three types of nodes: strings (such as 8/21), simple elements with no attributes (such as <h3>Syllabus</h3>), and link elements with a URL (e.g., <a href="https://www.youtube.com/watch?v=eeyjTatp_fI">DrRacket</a>).

We’ll use the following structure and data definitions:

(define-struct elem [tag content])
;A Simple-Element is: (make-elem String [ListOf Node])
;Interpretation: The string tag represents the type of HTML tag (e.g., "h1", "p",
;"em", ...), and content represents the list of nodes that occur under
;the tag.
 
(define-struct link [url content])
;A Link-Element is: (make-link String [ListOf Node]
;Interpretation: The string url represents the web address of the page
;linked to, and the content represents the list of nodes that occur under
;the link.
 
;A Node is one of:
; - String
; - (make-elem String [ListOf Node])  [Simple-Element]
; - (make-link String [ListOf Node])  [Link-Element]

For example, the content

There is no ethical consumption under late capitalism.

is generated by the HTML

There is <em>no</em>

<a href="https://en.wikipedia.org/wiki/Ethical_consumerism">ethical consumption</a>

under late capitalism.

and represented by the [ListOf Node]
(list "There is "
      (make-elem "em"
                 (list "no"))
      " "
      (make-link "https://en.wikipedia.org/wiki/Ethical_consumerism"
                 (list "ethical consumption"))
      " under late capitalism.")

Exercise 2 Write template(s) for processing a Node and for processing a [ListOf Node]. (Hint: These are mutually recursive data types, so in the body of the function that processes an Node, you’ll need to call the function that processes a [ListOf Node] and vice versa.)

Exercise 3 A WebPage consists of a URL (a string representing the address of the web page) and some content (a list of Nodes). Give a structure and data definition for a WebPage.

Exercise 4 Define an example of a WebPage that has the URL "http://example.com/index.html" and has the following content:

<h1>My Page</h1>

<p>Go

   <a href="http://example.com/links.html">here</a>

   for my

   <em>favorite</em>

   links.</p>

Exercise 5 Define an example of a WebPage that has the URL "http://example.com/links.html" and has the following content:

<p>This is my

   <a href="http://example.com/links.html">links page.</a>

   Back to the

   <a href="http://example.com/index.html">main page?</a></p>

Extracting Links

Exercise 6 Write a function list-outlinks that takes a [ListOf Node] and returns a list of all the URLs that are linked to within. (Hint: Since [ListOf Node] is one part of a mutually recursive data type, you’ll actually need to write two mutually recursive functions.)

Exercise 7 Write a function page-outlinks that takes a WebPage and returns a list of all the URLs that are linked to within the content of that web page. You should use the function you wrote in the previous problem to make this easier.

Exercise 8 Write a function get-page that takes a String representing a URL, a [ListOf WebPage], and a default WebPage. It should return the first web page in the list that matches the given URL. If no web page in the list matches the URL, it should return the given default web page.

Random Surfing

An important step in the PageRank algorithm involves randomly following links from a web page. We’ll need to be able to randomly select items from a list for this, so let’s take a brief detour to write some functions to help us do that.

Exercise 9 Design a function called list-reference that takes a [NEListOf X] and a Natural Number i (which must be smaller than the length of the list) and returns the element in the list that is in the ith position. Your function should pass the following tests:

(check-expect (list-reference (list 1 4 9 16 25) 3) 16)
(check-expect (list-reference (list "one") 0) "one")

(Note: There’s a built-in function called list-ref that already does this. You may not use that function for this problem. Hint: Your second argument is a Natural Number. Write down the recursive data definition for Natural Numbers and use the corresponding template.)

Exercise 10 Using list-reference, design a function that takes a [NEListOf X] and returns a random item from that list.

Exercise 11 Write a function random-surf that takes a WebPage and a [NEListOf WebPage] (to be used as a master list, which must include all of the web pages that are linked to in the given web page). It should return another WebPage, randomly selected from the web pages that are linked to in the given web page. If there are no links contained in the given web page, it should return a web page randomly selected from the master list. (Make sure your signatures match your definitions! This function should return a WebPage, not a URL string.)