The CGI (Common Gateway Interface) protocol governs a variety of communication between application software and information servers. Here we are concerned with its most common use: a web browser, the client, sends a URL (Uniform Resource Locator) to a web server, and the server sends back a page of characters in response. The server may be a single machine, or a cluster of machines offering offering a set of network services. Each server machine runs a set of processes that respond to client requests. Enhancing possibilities for confusion, such processes are also known as "servers."
We confine our attention URLs of the most common variety, using HTTP (HyperText Transfer Protocol), with syntax of the form
http:// host [ : port ] / path [ ? query ]
As usual, optional syntax is enclosed in brackets. The URL elements may contain letters, numbers, and certain punctuation, but no whitespace.
In the simplest case there is no URL query and the URL path specifies a file containing a page of text with HTTP formatting and hyperlink markups. The server responds by sending the file contents back to the client (browser), prefixed with the string
Content-type: text/html\n\n
where \n represents a newline character. Since servers may return many kinds of data, this prefix is necessary to indicate that what follows is plain text that is to be displayed with interpretation of HTTP markups.
The web began with such web pages, which are called static content, but today more often than not web servers are delivering dynamic content. That means the server is returning information that is generated at the time of each request, often using database information.
Dynamic content may be requested via hyperlinks, as with static content, but it is also frequently generated in response to HTML form requests (sent by clicking a submit button). Such requests may also be generated by client-side programs, typically written in JavaScript.
Traditionally CGI programs always returned a complete page for the browser to display. The latest revolution in web technology is Ajax (Asynchronous JAvaScript and XML), which allows client-side programs to request small bits of data from the server for dynamic modification of an existing page. This is the primary enabling technology of so-called Web 2.0 application.
We will see how to use Python CGI for responding to simple form requests. Python libraries for server-side Ajax exist, and client-side Ajax using Python on the horizon, but that's beyond this tutorial.
The basic CGI mechanism for dynamic content generation is for the URL path to indicate a CGI program, rather than a static web page file. The server creates a process that runs the program, and passes it the URL query string in the operating system environment variable named QUERY_STRING. The client in some cases sends additional data following the URL which the server forwards to the standard input of the CGI process. Then the server sends back to the client whatever the CGI program sends to its standard output, as with a print statement.
Of course reality does not always conform to the basic model. For example, the server sometimes runs the program directly, rather than creating a new process, for greater efficiency.
The server needs to knows which URL paths indicate CGI programs, and for security reasons it is important to group CGI programs in one (or a few) places. A common convention is to place them in a directory named cgi-bin in the server root directory.
Frequently CGI programs are written in scripting (dynamically typed) languages, in which case they are called CGI scripts. Perl has been the most common language for such scripts, but Python is rapidly gaining in popularity for this purpose, for which it is very well suited. Let's see how to write and test some Python CGI scripts.
Since CGI scripts run on a server, writing CGI scripts is server-side programming. For some purposes programs must run on the client side of the client/server divide, or at least their performance is much better if they are running in the client. At present the only client-side programming languages that are widely supported are JavaScript, Java, and (less often) Flash. Fortunately it seems that will change soon, thanks to Microsoft's new Silverlight plug-in. It will run on all major browsers and supports a wide variety of .NET languages, including IronPython.
Here is a very simple Python GCI program:
print 'Content-type: text/html\n' print '<strong>Python</strong> powered!'
All our CGI scripts will be returning HTML pages for display, so they all must print the content-type string before anything else. Don't forget the extra newline character at the end of the string. The print statement supplies the second required newline.
To try out this program, it needs to be installed where a running server process can find it. Server processes usually run on machines dedicated to this purpose, for reasons of security, reliability, and efficiency. Especially security, which must be a major concern whenever you are inviting the rest of the world to request services from a machine.
We will shortly see how to install our CGI programs on such a server. But it would be a real hassle if we had to install our program on a public server every time we want to test a change, and it might not be good for security to make a CGI program public before thoroughly testing it.
Fortunately most operating systems on which programs are developed recognize the domain name localhost, and its associated IP address 127.0.0.1, as referring to the development machine itself. Local server processes can respond to requests directed to this domain from clients on the same machine, without any of the complexity and security issues involved in responding to requests from outside of the machine.
Python comes with modules that can be used as the foundation for custom HTTP servers. These modules can also be used to easily implement a minimal server that just handles local Python CGI scripts. Here is our simple server program:
"""
Runs Python scripts in ./cgi-bin/<name>.py in response to URLS
of the form http://localhost/cgi-bin/<name>.py[?<query>]
"""
import CGIHTTPServer, BaseHTTPServer, sys
port = 80
handler = CGIHTTPServer.CGIHTTPRequestHandler
httpd = BaseHTTPServer.HTTPServer(('localhost', port), handler)
httpd.serve_forever()
Store this program in a directory with a subdirectory named cgi-bin. Here are a few things about this server program to keep in mind.
It must be started from its own directory.
Assuming it is stored in a file named cgiserver.py, it can be started by opening an operating system shell, changing to the directory in which it is stored, and then running the command:
python cgiserver.py
On Windows systems with the right association for .py file types, it can be started by double-clicking its file.
It sends output to its console (shell or double-click window) that records every access and displays error messages.
It offers its service indefinitely. To stop it you have to kill it, as by closing its console window.
It only runs Python scripts, which must be in the cgi-bin subdirectory and have file names with the extension .py.
It works on STC and LH115 machines, but not on PCs in which Python has been installed in a directory that has a space in its path (such as C:\Program Files\).
Let's give it a try! Store the script at the beginning of this section in a file named python_powered.py in the cgi-bin directory and visit
http://localhost/cgi-bin/python_powered.py
with your browser. It should display "Python powered!".
Before going further, it's good to know what happens when things to wrong, since they will.
Introduce a syntax error in the script, such as misspelling one of the print keywords, and try it again. Notice the error message in the server window. The browser gets back an empty reply, which it may display as a blank page, or some stranger behavior (Firefox acts like you are trying to open the script itself, but if you use the dialog to save it, the file is blank; whatever...).
Next introduce a runtime error, such as an undeclared variable reference, before the content-type print statement. The browser response is probably the same as for a syntax error, and the server window just indicates the script exit status was 0x1, rather than OK. That's not much help!
Now move the bogus piece of code after the content-type print statement, and notice that a traceback message appears in the browser. That's better, though the message is hard to read since HTML interpretation normally treats all whitespace as a single space. The moral of this story: print the content type information before anything has a chance to go wrong (other than syntax errors).
If we end up needing to parse complicated tracebacks, it would help to fix the whitespace interpretation problem. We also might like to catch errors so we can provide something informative (or at least less frightening) to our client, and record the gory details in the server error log.
The error log of our simple development server is the same as its normal log: the server console. As we shall see, production servers have log files that can be inspected when clients complain of script problems. So this is where you want error messages to go when others are using your script.
Try this script:
print 'Content-type: text/html\n'
try:
print '<strong>Python</strong> powered!'
bogus
except:
print '<p><strong>Something went wrong!</strong></p>'
import traceback
traceback.print_exc()
Exceptions are caught as usual with a try statement. In response to an exception, we let the user know something's amiss, and print the exception's traceback so it shows up on the server error log.
We could also have the traceback show up in the browser, as before, but nicely formatted this time, by replacing the except clause with:
except:
print '<pre>'
import traceback, sys
traceback.print_exc(file=sys.stdout)
print '</pre>'
You might prefer this for development purposes. Give it a try.
Here the HTML pre (pre-formatted) markup takes care of the whitespace interpretation. We also need to redirect the traceback printing to sys.stdout, the "standard output", for it to be sent to the browser. By default, the print_exc method sends it to the "standard error" output (the error log), as in our first exception handling code.
In our introduction to CGI we mentioned that the query part of the URL is stored in the operating system environment variable named QUERY_STRING when a script is invoked. If the optional query part of the URL is omitted, this string is empty.
Python's os module has a number of methods and other attributes for dealing with the operating system. One is the variable environ, which contains a dictionary with a binding for each variable in the operating system environment.
To take advantage of this, store the script
import os print 'Content-type: text/html\n' print os.environ['QUERY_STRING']
in query.py and visit the URL
http://localhost/cgi-bin/query.py?some+query
You'll see the output
some+query
Spaces are not allowed in URLs, so the convention is to replace them with + in a query.
HTML supports forms with text box, text area, check box, radio button, and menu elements. Form requests are only sent to the server when a submit button is pressed. There are two kinds of form requests. A get request sends all its information in a URL query string, while a post request sends its information following the URL, which the CGI program reads from its standard input.
Get requests are easier to program, but post requests can handle more kinds of data and are sometimes preferable for other reasons. Detailed consideration of HTML form specification is beyond this tutorial. Here we only illustrate the simplest form of Python programming technique to support both kinds of requests.
Let's try the following HTML page, which includes both a get form with a text box and a post form with a text field.
<html> <head> <title>Form CGI Test</title> </head> <body> <h1>Form CGI Test</h1> <form method="get" enctype="multipart/form-data" action="http://localhost/cgi-bin/get.py"> <input type="text" name="text input" value="text to get" size="40"/> <br/> <input type="submit" value="submit via GET"/> </form> <form method="post" enctype="multipart/form-data" action="http://localhost/cgi-bin/post.py"> <textarea name="text area" rows="5' cols="35"/> post this line and this one </textarea> <br/> <input type="submit" value="submit via POST"/> </form> </body> </html>
It indicates that in response to clicking the "submit via GET" button the localhost script cgi-bin/get.py is to be invoked. We test this with the script:
# get.py, responds to GET request by printing the query string import os print 'Content-type: text/html\n' print os.environ['QUERY_STRING']
If we haven't modified the initial contents of the text box, text to get, the browser displays:
text+input=text+to+get
Similarly, in response to clicking the "submit via POST" button, the localhost script cgi-bin/post.py is invoked. We test it with:
# post.py, responds to POST request by displaying stdin lines in the browser
import sys
print "Content-type: text/html\n"
for line in sys.stdin:
print line + '<br>'
Standard input is read a line at a time, and each line is printed with a following break markup. With the initial text field contents, the browser displays something like:
-----------------------------11765213935332 Content-Disposition: form-data; name="text area" post this line and this one -----------------------------11765213935332--
The first and last lines uniquely bracket the response text, so the server knows when the form data ends.
You've seen enough now to be able to write a variety of CGI scripts and test them locally with a little server designed for Python scripts. Yet without taking care of one more detail, your scripts will fail with a mysterious message on most production servers. The problem is that the server is prepared to run scripts written in many languages, and has no idea your program is written in Python! A .py file name extension does not help, since most servers ignore the extension. (Some even require that all scripts, including those in Python, have the extension .cgi.)
The standard way of telling the server that your script is written in Python is for the first line to be
#!/usr/bin/env python
This is just a comment as far as Python and other popular scripting languages are concerned. But when the server sees that the a script's first line begins with #!, it uses the rest of the line to start the program that is to interpret the script.
Unix system shells do likewise, so you can write Python shell command scripts in any Unix system with Python installed. Unfortunately this is one brilliant Unix invention that Microsoft never adopted.
Here's more of the story, if you're interested. The server runs the text following the #! as an operating system command, with an appended argument that is the path to the script file itself. The /usr/bin/env command runs its arguments as a command. The point of letting the env program start things, rather than pointing directly to the python program, is that python may be installed in different places on different systems, whereas env has a standard location and will find the python program if it is in the Unix command path (stored in the environment variable path).
Though a variety of useful scripts may be implemented using the basic technology presented here, it is important to have some notion of its limitations.
Web programming is often highly complex and tricky. We haven't mentioned support for pervasive technologies such as cookies and essential concepts such as session management, without which any but very simple web applications break down. One of the most important reasons for this is that servers are stateless: they do not (using the basic technology introduced here) remember anything about client from one request to the next.
Another source of complication is that advanced web systems must negotiate a maze of rapidly evolving and sometimes conflicting standards, whose implementations are invariable incomplete and buggy.
Finally, and most importantly, there are a great many ways in which security can be compromised if one is not very careful. The critical concern for the simple techniques presented here is that all request data be thoroughly validated before it is used. For example, it may be intended that a script only be invoked in response to submission of a particular HTML form, which highly constrains the form of the response information (query string or following standard input). But there is nothing that prevents a bad guy from targeting the script a response containing any kind of data, including data intended to trick the script into doing something very bad.
In response to these and other challenges, a variety of web application frameworks have been developed. These are libraries of code that allow most web application development to be approached at a much higher level of abstraction. It would be folly to develop a complex dynamic web without using such a framework, or perhaps a combination of several frameworks that address different issues. There are several such frameworks for Python. My favorite is django, but that's another story.
Well, that's enough new stuff and warnings for now. Go forth and produce Python CGI scripts! Then we'll see how to deploy them on a production server.