More questions on gemtext parsing

I'm rewriting my gemtext parser and I've landed on questions that I decided to ignore the first time around, but now I really want to solve:

  1. All lines beginning with the two characters "=>" are link lines, says the spec, but also is mandatory, so how do we handle a line that is simply "=>\r\n"?

  1. What if is not a valid URL or it doesn't have reserved characters encoded?

  1. Should single '\n' and '\r' characters be ignored, replaced by a single whitespace, or what?

=> Posted in: s/Gemini | 🛰️ lufte

Jan 08 · 11 days ago

12 Comments ↓

=> 🚀 stack · Jan 08 at 02:12:

as a common sense (not based on anything particular) response: 1 and 2 are simply malformed URLs, and 3 should be processed the same way as you process any LF or CR, and should at least terminate the line...

=> 🛰️ lufte [OP] · Jan 08 at 02:37:

In the case of a malformed URL I am now choosing to still show the link to the user, and it will become apparent only if clicked. The spec doesn't specify what to do though.

Regarding LF and CR, the spec states that line breaks are only represented by CRLF, implying that something else must be done with those characters if they are separate. Again, it doesn't specify what.

=> 🚀 stack · Jan 08 at 04:16:

huh... In spellbinding and my other CGI games I just use printf with a \n... I wonder if the server fixes it. Are you sure the data has to have \r\n?

Too lazy to search with my little phone

=> 🕹️ skyjake [mod...] · Jan 08 at 05:15:

  1. The BNF says that link lines must contain a URI, and I don't think an empty string qualifies as a valid URI, so a line with just a => should fall back to being treated as the default text line type.

  1. The client gets to choose how to treat invalid URIs. A specific behavior is not mandated.

  1. I recommend just stripping/ignoring all CR characters. A solitary CR should not be treated as a line break. The spec says that both LF and CRLF are acceptable line breaks. You'll typically only find CRLF line breaks in text produced on Microsoft Windows. (Or MS-DOS, heaven forbid.)

=> 🚀 sy · Jan 08 at 06:26:

Actually for gemtext, CRLF is explicitly required:

Gemtext is specified here in its so-called "canonical form" and, since text/gemini is a subtype of MIME type "text", line breaks are therefore represented by the sequence CRLF. Note however that the Gemini network protocol specification allows any subtype of "text" to be transmitted with line breaks represented by LF alone.

=> — Gemtext specification (cf. STD68 / RFC5234)

But I agree that requiring just an LF would be better – and would cover CRLF cases, too.

People ignore it and just use LF anyway. I’ve noticed only one capsule that has CRLF.

=> 👻 mediocregopher [...] · Jan 08 at 08:01:

@sy that excerpt reads to me like LF is acceptable for subtypes of text, and explicitly says that gemtext is a subtype of text, therefore LF is acceptable for gemtext.

=> 🚀 sy · Jan 08 at 08:40:

@mediocregopher: I actually had excerpts from the ABNF definition below it, but couln't submit a comment that long, and trimmed:

text-line        = *(WSP / VCHAR) CRLF
; CRLF          from [STD68]

With the formal ABNF, I read it as: without CR it is not (strict/canonical) gemtext but a mere text subtype that can be sent via the Gemini network protocol.

=> 🕹️ skyjake [mod...] · Jan 08 at 08:44:

@sy You can submit longer comments as drafts, or by editing and appending more text, or by merging multiple consecutive comments together.

The BNF is inaccurate there, or shall we say the "canonical" form is not exactly the one used in the wild. Gemtext does not require CRLF line endings. As the beginning of the document states, both LF and CRLF are acceptable and should be handled gracefully by the parser.

IIRC, @solderpunk made some remarks about the canonical CRLF usage somewhere, perhaps in the Project Gemini official news feed. However, I don't think there are many people that specifically choose to use CRLF line endings when they write some .gmi files.

=> 🚀 stack · Jan 08 at 13:38:

As usual, every platform just had to do it in a different way:

Unix: newline

Apple: cr

Microsoft: both

The Unix way is of course the right way.

To make it worse, the tty can and often does substitute one for the other, both on input and output.

=> 🚀 sy · Jan 08 at 14:40:

@skyjake Gemini news talks about empty documents and documents that lack a new line on the last line, and has this:

clarified that Gemtext is specified in its "canonical form" (and therefore uses CRLF everywhere).

=> — geminiprotocol.net/news/2024_08_28.gmi

And the wire protocol has this:

When in canonical form, media subtypes of the "text" type use CRLF as the text line break. Gemini relaxes this requirement and allows the transport of text media with plain LF alone (but NOT a plain CR alone) representing a line break when it is done consistently for an entire response body. Gemini clients MUST accept CRLF and bare LF as being representative of a line break in text media received via Gemini.

=> — geminiprotocol.net/docs/protocol-specification.gmi

Reading the document and wire specifications together, I would understand that rather than stripping the CR, a network client should add the CR if it doesn’t exist on the wire :)

As for actual answer on questions by @lufte: I would suggest that, if you cannot unambiguously ‘fix’ the inconsistencies when parsing, you can always output the line as is, as ordinary text.

=> 🛰️ lufte [OP] · Jan 08 at 15:26:

@mediocregopher I like your interpretation: if any subtype of text (text/gemini being one) can be transmitted with line breaks represented by LF alone, then it means they should be parsed as line breaks as well and not be just merely "transmitted".

The excerpts of the network protocol provided by @sy also support this interpretation.

As for "=>\n" lines, defaulting to text may be the most sensible choice.

=> 🚀 stack · Jan 08 at 16:20:

I am willing to publically shame those not using Linux or maybe BSD! Or ignore their backwards propriatary ways.

Proxy Information
Original URL
gemini://bbs.geminispace.org/s/Gemini/23546
Status Code
Success (20)
Meta
text/gemini; charset=utf-8
Capsule Response Time
210.520585 milliseconds
Gemini-to-HTML Time
1.484224 milliseconds

This content has been proxied by September (ba2dc).