Protocol pondering intensifies, Pt II

In the previous post in this series[1] I compared request formats for

gopher and HTTP and thought a bit about what a good anonymous document

system actually needs. I ended up deciding that the answer was

nothing more than gopher already provides. In this post I'll continue

that discussion, focussing instead on the response format.

Recall that a gopher server's response to a request consists of

nothing more than the content. What does HTTP look like? Here's a

quite light real world example, obtained by requesting the /index.html

path from grex.org:

HTTP/1.0 200 OK

Server: nginx

Date: Fri, 14 Jun 2019 19:16:18 GMT

Content-Type: text/html

Content-Length: 45

Last-Modified: Sat, 21 Apr 2018 12:23:32 GMT

Connection: close

ETag: "5adb2d44-2d"

Accept-Ranges: bytes

The very first part, HTTP/1.0, is of course the protocol version.

Notice that this was the last component of the request, but it's the

first component of the response. What's all that about? Anyhow, in

general I think it makes a heck of a lot of a sense for a response to

a request to use the same protocol version as the request, which the

client of course is already aware of, so this is dead weight. The

next part, "200", is a status code, indicating whether or not the

request was successful or triggered an error. It's followed by a

human-friendly version of the machine-friendly status code, in this

case simply "OK". There are lots of lots of status codes in HTTP[2]!

Then we have a bunch of headers, which look just like the request

headers from last post. There's a lot of dead weight in here, for

simple purposes. The "Date:" is in there for cache-related reasons. In

a protocol without caching, this is useless. Specifying the "Server:"

software and version serves no useful purposes, and many webserver

admins actually disable this feature to avoid giving away hints about

which vulnerabilities might be applicable to their server. The

"Content-Length:" is useful for when a single TCP/IP connection is used

for multiple request/response pairs. There's some overhead involved

in setting up and tearing down these connections, and as webpages

started to trigger more and more requests - to fetch stylesheets, and

scripts, and images - this overhead added up to a non-trivial part of

total time for a website to render. Re-using connections is one

solution to this, and it means that the client needs to know when the

server is done responding, and it can do this by counting bytes until

the entire "Content-Length" has been received. A better solution to

this problem is to Stop Making So Many Damn Requests, which means the

server can signal the end of the content by just closing the

connection, rendering this header useless.

Is there anything of value in here? I think so! The status code is

interesting. Did you know that gopher has no real way to signal an

error? You might be thinking "Hey, what about item type 3?", but the

thing about item type 3 is, well, it's an item type. When do we see

item types? In gopher menus, and in gopher menus only. If a gopher

client sends a request for what it thinks should be a text file, but

it's followed a misspelled selector and the file doesn't exist, the

client isn't going to try to parse the response as a menu, so it's not

going to have any way to recognise the error item type. Indeed, if

you request a non-existent selector from a gopher server, it'll say

something to you like "Error: File or directory not found!" (this is

what Gophernicus will say), but it's only you, as a human, who

hopefully reads English, who can recognise this as an error. A simple

script has no way to distinguish this situation from a totally

successful transaction. Because of this, it's e.g. impossible to

write a script to crawl a gopherhole looking for broken links. Well,

maybe not impossible, but certainly non-trivial: you could figure out

the particular server's idiosyncratic choice of error message by

requesting a couple of randomly-generated long selectors which are

highly unlikely to be in actual use, and use the most common response

as the "404 equivalent string". Needless to say, this is not exactly

simple. It's perhaps not the biggest problem in the world, but

it's certainly a shortcoming of gopher which could very easily be

avoided.

But much more interesting and important is the "Content-Type" header.

Gopher, frankly, sucks at signalling content type. If you've arrived

at a document via a gopher menu, then you know its item type. What if

you want to request a document directly, not by following an item in a

menu? Maybe your friend has told you the selector in an email or via

XMPP. Maybe you bookmarked it last month. You can request that

document by just sending the path and a , but how do you know

what kind of content you're getting back? If you don't somehow know

it in advance, you need to figure it out for yourself by looking at it

hard ("you" here are a gopher client, not an end user). This is the

reason that gopher has its very own unique URL scheme with its own RFC

(RFC4266), where the itemtype is introduced as an extra component of

the path. You need to write

gopher://zaibatsu.circumlunar.space/1/~solderpunk instead of just

gopher://zaibatsu.circumlunar.space/~solderpunk because with the later

option your client would have no idea whether or not it should try to

parse what comes back as a menu, display it as text or save it as a

binary file. This problem is also the reason that if you write a

gopher client with bookmark support, you need to store the item type

along with host, port and selector. Neither of these things are

terribly hard, but they are examples of small, inelegant extra hoops

which have to be jumped through because gopher, in this respect, is

too* simple. It's too simple to straightforwardly handle a

perfectly reasonable situation like "I'd like to fetch this document

from this server but I've never seen it appear in a menu because my

friend just emailed me the link". To me it makes a lot of sense

that the only piece of information you should need to request and

then make use of a resource is that resource's path. That seems,

well, simple.

This problem in gopher is more widespread than just not knowing what

item type a document is. Even if you know that a path points to an

item type 0 text file, you can have problems. One of the earliest bug

reports I got after releasing VF-1 turned out to be the result of

floodgap.com using iso-8859-1 text encoding to support accented

characters in some of their content. VF-1 had just assumed that

everything on gopher was ASCII, which turned out to be very wrong.

There are a lot of encodings out in the wild on gopher. Standard

gopher has no way of telling you what they are. The only way to write

a client which can Just Go Anyway is to user some kind of third party

party library to try to "sniff" the encoding (VF-1 uses Chardet[3] for

this). That's a hard problem, which is never guaranteed to be

solvable, and is only possible using a big slab of natural language

corpus statistics. This requirement massively flies in the fact of

the RFC1436-enshrined philosophy that "intelligence is held by the

server". When all a protocol does is shovel a bunch of bytes down

your throat and say "you figure out what this is and what to do with

it!", you need a very intelligent client for it to really work out

in all conditions. I don't think it makes much sense to have every

client repeat exactly the same set of expensive computations

after requesting a document in order to figure out information that

the server already knows, but didn't share.

There's a saner alternative to this, and it's for the server to tell

the client, succinctly, what it's actually getting. This can be

implemented with a very small increase in protocol complexity, which

can result in a very large decrease in client complexity.

Consider the following as a response format, in a hypothetical

protocol which retains gopher's bare bones request format:

A concrete example:

200 text/plain utf-8

Hello, world!

The text encoding could be optional for non-text MIME types. We could

get away from having to specify an encoding at all if this protocol

specified "Thou shalt use UTF-8 and no other encoding shalt thou use",

saving us ~5 bytes, but I dunno if that's too authoritarian. Yes, you

can represent any language you like in UTF-8, but some languages can

be represented more compactly in other encodings, and it seems like a

good thing to provide the ability to minimise the number of bytes sent

over the network. Isn't that also part of the spirit of a minimalist

protocol? A compromise: if you use UTF-8, it's valid to leave off the

third component of the response header. UTF-8 is the implicit

default, but other encodings are possible for a tiny extra cost.

For the sake of fully specifying a system, including a navigation

solution, without any further discussion or design, let's keep

gopher's menu system as is, and introduce a new pseudo-MIME type for

it, like text/menu or something. I'm not saying this is a great idea,

it just provides a complete concrete example to talk about for the

rest of this post.

If we give gopher a complexity score of 1 and full-blown HTTP a

complexity score of 100, I don't see how this new protocol can be

reasonably scored higher than 10. It's still absolutely trivial to

write a client for this protocol, a nice little weekend project. You

can memorise the protocol easily so you don't need to look up a

complicated RFC to remind yourself of some detail while coding. You

can still cobble together a client out of standard unix utilities:

the response header is guaranteed to be one line long, so you can

just pipe what you get from the network through tail -n +2 to cut it

off. I'm not sure if that would work for binary files, admittedly,

but for something vaguely gopher-like that's an edge case anyway. You

could even still use telnet as a client for this protocol if you

wanted to. Yes, you would see one short line of noise at the top of

each file, but that's a heck of a lot better than seeing a full set of

HTTP headers and I guarantee you'd get used to it and stop even

consciously seeing them after a day of practice. None of the extra

information in this header represents any threat to a user's privacy.

The network overhead is around 20 bytes per request, which is less

than 1% of the size of a typical phlog post.

Compared to gopher, this protocol can:

Use standard URLs without embedded item types, without any

ambiguity.

Serve plain text in any encoding under the sun, without ambiguity

that would otherwise force the client to waste computational effort

trying to identify the encoding.

Serve any kind of non-text content under the sun, without ambiguity

that would otherwise force the client to waste computational effort

trying to identify the binary file format, and without being forced

to categorise the content as one of a small number of pre-defined

item types which are either hopeless vague or, in 2019, just kind of

whacky (e.g. gopher item type 5, "PC-DOS binary file of some sort").

Precisely indicate error conditions in a machine-readable way. In

the example above I just copied HTTP's "200" status for

"everything's fine", but in reality HTTP's three digit status codes

are surely overkill for anything vaguely gopher-like. Status codes

could probably be a single character. I haven't thought too much

about applications of these. We could go nuts, implementing

redirects and all sorts, but I'm not really keen. From time to time

there are complaints on the gopher mailing list about badly behaved

crawlers making too many requests per second and overloading

servers, so a "too many requests, try again later" error code would

seem a practical thing. I'm not imagining any situation where

99.9% of requests result in more than 3 or 4 statuses. It should be

possible to learn all the status codes by heart easily.

This protocol is not as simple as gopher, but I would argue its power

to weight ratio is substantially greater. It's still very simple, and

its still totally harmless. Crucially, it's non-extensible: the

response header is not open ended, like HTTP's is, so people can't

just add in whatever extra junk they like. I don't want to say that

extensibility is a bad thing, it's often a very smart engineering

solution to some particular problem, but I think I do want to say that

extensibility is the enemy of intentionally brutal simplicity.

Optional extra cruft will inevitably accumulate and then become a de

facto requirement.

In the third and final post in this series, I'll address

possible solutions to the problem of navigation.

[1] gopher://zaibatsu.circumlunar.space:70/0/~solderpunk/phlog/protocol-pondering-intensifies.txt

[2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

[3] https://chardet.github.io/

Proxy Information

Original URL: gemini://zaibatsu.circumlunar.space/~solderpunk/phlog/protocol-pondering-intensifies-ii.txt
Status Code: Success (20)
Meta: text/plain; charset=utf-8
Capsule Response Time: 406.830565 milliseconds
Gemini-to-HTML Time: 2.297624 milliseconds

This content has been proxied by September (ba2dc).