To figure out how proxies work without screwing up your browser proxy settings, you can use curl:
Note that in regular HTTP servers, the http://xyz.com/ part is in the Host: header.
Parameters for GET are in URI. For POST, they come as data elements after header + blank line:
- hiding machines; proxy does all of the outside requests
- speed: proxies can cache commonly requested resources, increases bandwidth
- track and/or block user access
- sniff/scan data going out or coming in
- circumvent government web restrictions; I had to ssh to my box from Guangzhou to access a website once. All you need is a machine outside of the firewall. Set it up as a proxy to access the outside world and then have your browser pointed it.
The proxy must collect all of the headers obtained from your browser and pass those along in its request to the remote server. Similarly, the headers received by the proxy from the remote server should be sent back to the browser as part of the response.
Here is what I get back from antlr.org
$ telnet www.antlr.org 80
Connected to www.antlr.org.
Escape character is '^]'.
GET / HTTP/1.1
HTTP/1.1 200 OK
Date: Tue, 30 Aug 2011 17:53:02 GMT
Set-Cookie: JSESSIONID=901C40246D69C9129E4AF7376B4553E1; Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/html; charset=UTF-8
To process a single request, the basic algorithm was like this:
- get first line from browser, split into the command, the URI, and the HTTP version;
e.g., "GET http://xyz.com/foo HTTP/1.0"
- read in headers until we see a blank line; force the header names to lowercase and put the name-value pairs into a map
- strip out user-agent, referer, and proxy-connection headers
- strip out connection header if its value is keep-alive
- parse the URI to strip out the host and get the "file" name like /foo
- open a socket at port 80 at the remote host
- send it the same HTTP command that you got from the browser except make the version 1.0 not 1.1 so we don't have to worry about chunking and use the file name not the entire URI
- then send the remote host all of the headers we got from the browser minus the ones we deleted
- if POST, get the content-length header from the browser and copy the data following the browser headers to the socket to the remote host
- read the response line from the remote host like "HTTP/1.0 200 OK"
- send it back to the browser
- strip out connection header from the remote host response headers if its value is keep-alive
- send remote host response headers back to the browser
- read remote data and pass it back to the browser
Some servers and browsers use different cases, normalize your headers to all be lowercase. I tried this in my solution and it seems to work.
This algorithm does not deal with keep-alive connections. It forces browsers and servers to do one socket connection per request. Despite buffering input and output streams, browsing with my proxy is significantly less performant than without the proxy.
This mechanism handles all of the redirects and caching stuff with no problem. For example, in response to request
GET http://pagead2.googlesyndication.com/pagead/show_ads.js HTTP/1.1
The remote server might return
HTTP/1.0 304 Not Modified
which we can send directly back to the browser.
proxy often acts like a load balancer and allows clients from outside of the company to access multiple resources from outside the firewall. A request to the reverse proxy triggers an internal request to the company's servers; that data is returned to the client. These kind of proxies can also do filtering and so on.
Reverse proxies can also have SSL acceleration hardware that removes the burden from the actual Web servers behind them.
Network address translation vs Proxies
NAT is the stuff that your Internet cable modem router does to convert an outside public address to an internal private address like 10.0.0.1 or 192.168.0.5. All internal addresses flip to a single external address for outgoing traffic. That is done at the network layer whereas proxies are done at the application layer. Literally an application on the server is deciding to forward a request to another server.