by Steven J. Owens (unless otherwise attributed)
Most people seem to take HTTP for granted, but I've found it's a really, really good idea to get a packet sniffer or a logging proxy and actually watch what the browser sends up to your server, and what your server sends back, and learn about HTTP.
This is especially true when you're doing something clever with the browser. It's even more true when you're trying to figure out why your cleverness isn't working! But I've found in general that being able to watch the actual protocol has given me a lot more of a feel for what's going on. I don't like "magic".
Mozilla Firebird has this nifty plug-in, called Live Http Headers, which displays the header content of the HTTP traffic as you use the browser. It's simpler and easier to use than tcpdump, but I still recommend using tcpdump or tcpflow to really watch the entire connection. Unfortunately this gets a little ugly when your browser is fetching binary objects, like gifs or jpegs.
Note: I've recently come across what two HTTP proxy logging tools that appear to be the sort of thing I like to use; Nettools (http://neilja.net/nettool/index.html) and ParosProxy (http://www.parosproxy.org/index.shtml). Both are implemented in java, so they're truly multiplatform and will run wherever you want to use them.
Most programmers have a basic idea of what HTTP is. It's stateless; the browser opens a new connection for each request, closes the connection when it's received the response, and renders the results for the user. All action is initiated from the browser to the server. There is no persistent connection with the server, there is no way for the server to get a handle on the browser, there is no way for the server to initiate any action to the browser.
This gets complicated a bit by cookies and HTTP 1.1 persistent connections, but the fundamental nature is still there.
When your browser makes an HTTP request, it opens a tcp/ip connection to port 80 on the webserver (or port 443 if it's an SSL server, with a URL that begins with "httpS:" instead of "http:"). A tcp/ip connection is like a stripped down telnet connection (telnet adds a very small set of escape characters on top of a regular tcp/ip connection). You can fake an HTTP request easily by telnetting to port 80 on a webserver and typing the appropriate HTTP comands in. In the example below, the stuff I actually type is in bold text. Any special keys, like the enter key, are typed in emphasized bold text, with <angle brackets>:
00:39:07, puff@darksleep:/var/www/htdocs/notablog/article> telnet www.darksleep.com 80<enter> Trying 184.108.40.206... Connected to darksleep.com. Escape character is '^]'. GET / HTTP/1.0<enter> <enter> HTTP/1.1 200 OK Date: Thu, 26 Feb 2004 05:39:21 GMT Server: Apache/1.3.27 (Unix) Debian GNU/Linux PHP/4.2.3 Last-Modified: Fri, 30 Jan 2004 23:19:18 GMT ETag: "c8695-796-401ae676" Accept-Ranges: bytes Content-Length: 1942 Connection: close Content-Type: text/html; charset=iso-8859-1
<html> <head> <title>DarkSleep</title> </head> <body BGCOLOR="white"> <hr> <h2>Welcome to darksleep.com <a HREF="beehive"><img ALIGN="center" ALT="Beehive" BORDER=0 SRC="beehive.jpg" width="75" height="50"></a></h2> <p>Someday there might even be a website here.</p> <p><center><img ALT="Give me coffee and nobody gets hurt" SRC="constructioncoffee.gif"></center></p> <p>Meanwhile, here's some <a HREF="/puff/">stuff</a>, mostly random articles and essays I've written over the years.</p> <hr> Here's a nice <a HREF="everyossucks.mp3">song</a>. </body> </html> Connection closed by foreign host. 00:39:15, puff@darksleep:/var/www/htdocs/notablog/article>
A couple of things to note here:
First, note that I typed <enter> twice after my first command. A blank line indicates the break between the head and body of a request or response. Most requests (except POSTs) have no body, so that blank line is the end of the request.
Second, the HTTP request in this example is very simple, I just typed the HTTP GET command itself, with the url "/" and the mandatory HTTP version argument. I didn't fake any headers along with the request.
Normally the browser will send along a half-dozen or more headers with each. Cookies, for example, are sent along with each request to the location that set the cookie, whether the request has anything to do with the cookie or not.
If any of this sounds familiar, it should, because it's pretty much the format used for an internet email. Specifically, it's a MIME, which is the more modern standard for internet email format. An excellent, detailed explanation of MIME encoding is in chapters 3 and 4 of the O'Reilly Programming Internet Email. The short form is you have a set of header lines, a blank line, and then the body. Each header consists of name, a colon (:) and a value.
Like I said above, most HTTP requests don't have bodies. POST has a body. The seldom-implemented PUT has a body. I'm not sure what else does. A good place to find out would be at w3.org.
HTTP responses are also a MIME, and they almost always have a body, which contains the actual data asked for, the HTML tags for the page, the binary data for an image, or whatever.
Speaking of binary data, one thing to note; unlike standard RFC 1521 MIMEs, HTTP MIMEs don't bother to base64-encode the binary data. I guess this makes sense; since the binary data is going straight from the browser to the server (or in rare situations, vice-versa) they don't need to worry about some mail server mangling it.
There is, as near as I can tell, nothing that delimits the end of an HTTP request or response, except the fact that the client or the server didn't send any more data, or closed the connection. With HTTP 1.1, with persistent connections, you can set headers to keep the connection open and wait for more data to come down. HTTP 1.1 also supports "chunked encoding", where the response includes a chunk-size value and then the body is split up into chunks. Modern clients support this, but I don't think many web applications use it.
Browser-based uploads (officially multipart/form-data, rfc 1867, but everybody seems to just call it upload) are done by the browser sending a POST with an additional header (Content-Type: multipart/form-data). Instead of the normal query string encoding, the posted data is stored in something much more like a MIME with a binary attachment. Once again, the binary data itself isn't base64-encoded.
Most modern browsers open several simultaneous connections to the server when it's convenient, typically when they grab a page and the page has image tags for several images. When I first heard about HTTP 1.1 and persistent HTTP connections, I remember hearing some talk that browsers would keep the persistent HTTP connection open so they could grab the image data referred to by a page. I'm not sure that any browsers actually do this, however. I get the feeling they don't.
Oddly enough, while HTTP operations (http://www.w3.org/Protocols/HTTP/) and URL encoding (http://www.w3.org/Addressing/URL/uri-spec.html) are well-specified, I can't seem to find any official document that specifies how CGI query strings with parameters should be constructed. Then again, I can't claim to have looked exhaustively.
Parameters are sent by the following process.
First, the parameter values are URL-encoded. URL-encoding and decoding can be done with java.net.URLEncoder and java.net.URLDecoder. For example, a tab becomes the characters %09, linefeed becomes %0A, and so forth. Spaces may be translated to plus (+) characters for some reason (instead of using a %nn code).
Then the parameter name string and value string are concatenated with an equals sign (=) between them:
Then multiple parameters are concatenated with an ampersand (&) between them:
I've also read that optionally you can use a semi-colon (;) to separate parameters, but I've never seen that done in practice, and a little quick testing shows that the servlet API, or at least Jakarta Tomcat (which is the reference implementation servlet engine) doesn't recognize semi-colon as a separator.
With a GET, the parameters are just glued onto the end of the URL, with a question mark (?) to separate the parameter string from the URL:
With a POST, on the other hand, the parameters are stuck in the body. I'm still working on finding a good example of a POST, but I left my packet sniffer in my other pants.
One thing I'm not entirely sure of is how POST is handled when you have a whole lot of parameter data. Does it never put in a newline? Does it just put a newline at one of the ampersands? Before or after the ampersand? I'll have to look into this.
The classic advice about GET and POST with HTML forms is to use GET for scripts or server-side actions that are idempotent, which means repeated invocations should have the same effect as a single invocation. (For example, pushing the "on" switch, versus pushing the switch once to turn it on, again to turn it off, etc).
GET parameter strings can end up in bookmarks. This tends to make security-conscious folks a bit paranoid, since one could bookmark a username and password parameter string, for example. This is a good thing to avoid with secure data. But then people go a little overboard and decide that GET is the anti-christ. Bear in mind that from a network-level point of view, GET and POST are nearly the same thing; the only difference is a few newline characters.
GET also was widely rumored, in the early days of the web, to be fraught with peril if you had a large amount of parameter data, due to a rumored widely circulated bug in some of the C text libraries. POST, on the other hand, is supposed to be able to handle large quantities of information (megabytes, in the case of multipart/form-data POSTs).