This page permanently redirects to gemini://dkalak.de/software/gmi2html/.

gmi2html - convert Gemtext to HTML

The script I present here reads Gemtext from stdin, converts it to HTML, and writes it to stdout. It also checks the input for prettiness. If it encounters ugly parts, it writes warnings to stderr. It exits 0 if nothing was written to stderr and 1 otherwise.

The script is written in Python. Maybe I will rewrite it in C once I am proficient enough.

Motivation

There are many Gemtext to HTML conversion tools. gemini.circumlunar.space lists some.

=> Gemini software

I have tried the following 2:

=> huntingb’s gemtext-html-converter (in Python) | Nicholas Johnson’s gemini2html (in C)

Both render adjacent text lines in the Gemtext input as separate p elements in HTML, and empty lines in the Gemtext input either as empty lines in HTML (which are insignificant) or as br tags outside of p elements. (I have also encountered other issues, like unescaped > characters and some files failing to be converted at all.) The web version of gemini.circumlunar.space seems to have separate p elements for adjacent text lines too, and empty p elements for empty lines.

I think that these approaches are semantically incorrect. Instead, I think that:

Adjacent text lines in the Gemtext input semantically represent a single paragraph. As such, they should be part of the same p element in HTML, and only separated by br tags inside that p element.
Single empty lines between such paragraphs (and/or other blocks) in the Gemtext input are aptly represented by the spacing that HTML renderers insert between p elements (and/or other block elements) by default. They shouldn’t be inserted as additional br tags, and the default inter-element spacing shouldn’t be removed with CSS.
Multiple empty lines (or no empty lines at all) between paragraphs (and/or other blocks) in the Gemtext input – or empty lines at the beginning or end of the document (or a quote block) – change the markup, but not the internal structure of the text. They should be represented by margin statements in CSS, not as br tags.
In general, br tags in HTML should be surrounded by text on both sides, br tags should be inside a p element, and p elements should not be empty.

Definitions

By “empty line”, I mean a line that contains either no characters or only whitespace characters. By “block” in Gemtext, I mean:

adjacent text lines (paragraphs),
adjacent * lines (lists),
adjacent => lines (link lists),
adjacent > lines (quote blocks),
2 ``` lines and the lines inbetween (code blocks),
a #, ##, or ### line, regardless of its surroundings (headers).

Every block in the Gemtext input is represented by a block element in HTML (p, ul, blockquote, pre, h1, h2, h3). (I don’t count li elements as block elements here, although they technically are.)

A quote block, stripped of the > character in each line, in turn contains paragraphs. They are represented by p elements inside the blockquote element in the HTML output.

Implementation

The script reads 1 line from stdin into memory at a time, determines its kind, and writes the corresponding output to stdout and stderr. It not only distinguishes between lines belonging to different kinds of blocks, but also between:

the lines inside a code block and the ``` lines that surround them,
empty lines and text lines inside of quote blocks,
empty lines and text lines outside of quote blocks.

Because only 1 line is buffered, the script doesn’t insert CSS margin statements in cases where Gemtext blocks aren’t separated by exactly 1 empty line, or where a quote block or the document begins or ends with an empty line. While margin-top statements are feasible with a single-line buffer, margin-bottom statements (e.g. if a paragraph is followed by empty lines at the end of a quote block) require that the entire block in question is buffered before the amount and status (i.e. whether they are at the end of a quote block or the document) of the following empty lines can be determined. There is also no obvious solution for how to treat quote blocks of arbitrary lengths that only contain empty lines.

So instead of trying to handle those cases, the script assumes that the input is formatted in a “pretty” way that doesn’t require such handling, and writes warnings to stderr (without doing anything special) when the input is “ugly” (i.e. not pretty). Pretty input needs to meet the following criteria:

There are no adjacent empty lines outside a quote block.
There are no adjacent empty lines inside a quote block.
The first and last line of a quote block aren’t empty. (This implies that there is no quote block that only contains empty lines.)
The first and last line of the entire document aren’t empty.
Blocks (whether outside or inside a quote block) are always separated by a single empty line.

The script also issues ugliness warnings if any of the following criteria aren’t met:

Code blocks need to contain at least 1 line between the ``` lines.
lines mustn’t contain any text after the.
*, =>, #, ##, and ### lines need to contain text.
Lines mustn’t contain any trailing whitespace (even inside code blocks).
=> and > lines need a single space after the => or > (just like *, #, ##, and ###).
Other than the single space after the *, =>, >, #, ##, or ###, text mustn’t contain any leading whitespace (except in code blocks).

These last criteria aren’t related to the question of CSS margins. Some of them are more or less expressions of personal taste that are enforced to be applied consistently. While those warnings don’t necessarily mean that the HTML output is broken (=> and > lines without a space after the => or > work just fine, for example), the script also doesn’t strip away leading or trailing whitespace that it warns against, which might look ugly.

Output

The script writes lists as ul elements and link lists as p elements with the different a elements separated by br tags.

The script doesn’t write optional closing tags (, ). Opening and closing ul, blockquote, and pre tags are written on their own lines. Other opening and self-closing tags (p, br, li, h1, h2, h3) are written at the beginning of a line, other closing tags (h1, h2, h3) at the end of a line.

The script writes newlines in the HTML output for every input line it reads (even empty ones), so block elements are separated by (insignificant, but pretty) empty lines in the HTML output if the input is pretty.

The output covers only the HTML code that you would put inside a body element. The body tags themselves and everything else that is needed for a valid HTML document are not included in the output.

Tips

You can pipe the output to fmt -s to get more consistent line lengths. Make sure that no line in a code block exceeds the maximum and goal line lengths of fmt. You can get the maximum code block line length of the output like so:

cat out.html |
awk '
  /<\/pre>/ { pre = 0 }
  pre == 1  { print   }
  //   { pre = 1 }' |
wc -L
Correction (2023-02-01)
I read today that my statement above about adjacent text lines semantically representing a single paragraph actually goes against the Gemini specification. Use the software at your discretion.
=> Gemini specification
EOF

Proxy Information

Original URL
gemini://dkalak.de/software/gmi2html
Status Code
Success (20)
Meta
text/gemini; lang=en
Capsule Response Time
259.56818 milliseconds
Gemini-to-HTML Time
1.666226 milliseconds

This content has been proxied by September (3851b).