____ _ _ _ _ _ _____ __ __ _ / ___| ___ _ __ ___ | |_ _____ _| |_ | |_ ___ | | | |_ _| \/ | | | | _ / _ \ '_ ` _ \| __/ _ \ \/ / __| | __/ _ \ | |_| | | | | |\/| | | | |_| | __/ | | | | | || __/> <| |_ | || (_) | | _ | | | | | | | |___ \____|\___|_| |_| |_|\__\___/_/\_\\__| \__\___/ |_| |_| |_| |_| |_|_____| _ ___ ___ _ ____ _____ _ __| |_ ___ _ __ / __/ _ \| '_ \ \ / / _ \ '__| __/ _ \ '__| | (_| (_) | | | \ V / __/ | | || __/ | \___\___/|_| |_|\_/ \___|_| \__\___|_|
A text/gemini document to text/html (Gemtext to HTML) converter. The idea is to create an HTML document that's easy to read the code of, and that follows as close as possible to the specifications[1][2] of a Gemtext document, but in HTML. It should also have a simple basic CSS stylesheet if one isn't supplied.
=> [1] Gemini specification (gemtext) | [2] Gemini specification (HTML)
The idea is to follow the simple line basis of the gemtext spec, and reflect that within HMTL code as well. The upshot of this is that for instance the "Unordered list items":
/^\* / && preformat_toggle == false { sub(/^\* /, "") print body_padding "
Are actually represented within HTML, not as grouped together under one HTML unordered list element, but each individual list items is contained within it's own unordered list element. The only state linked to the parseing of the gemtext document, is within the preformatted text sections, which toggle on and off. As can be seen in the "Unordered list items" section, it only interprets the list items when not within a preformatted text section.
The preformatting toggle lines (code blocks, preformatted text sections) are the only lines that require state. More specifically, they toggle on and off preformatted text output. The toggle is also used when the preformatted text line matches the open and closing delimeters for preformatted text (three backticks "```"), so that it knows if this is opening a preformatted text section, or closing one.
/^```/ && preformat_toggle == false { preformat_toggle = true preformat_start = true sub(/^```[ \t]*/, "") sub(/[ \t]+$/,"") if ($0 != "") { preformat_title = $0 if (preformat_title == "TOC" && TOC == true) { print_toc(TABLE_OF_CONTENTS) } } next } /^```/ && preformat_toggle == true { preformat_toggle = false if (preformat_start == true) { preformat_start = false } else { print "" if (preformat_title != "") { print body_padding "
A couple of things to note, is that if there are no lines between the start and end of a preformatted text section, then nothing is printed. The toggle lines themselves, starting with three backticks ("```") are never output as line, per the spec.
The other notable part, is that the opening toggle line, can have a text section after it. The code trims any spaces from the begining, or end of the text, and uses that as the HTML title for the preformated section. There is a special case where the preformatting text line equals "TOC" and the "TOC" parameter passed in on the command line is "true". In this case a table of contents is printed at this point, created in the setup section. To aid accessibility in the HTML output, the preformatted text section, is wrapped in a "figure" tag with a figure caption if the text section is included[1]. The ARIA role of "figure" is used[2].
=> [1] Pre tag accessability | [2] ARIA figure role
The "print_toc" helper function loops over the passed in table of contents array in order. It first works out what level the current heading is, based on the number of "#"s at the begining of the heading, either 3, 2 or 1. The heading is then trimmed of all it's "#"'s, and preceding and trailing spaces. The ID is then created, and the indentation prefix is created. The HTML fragment link is then printed out.
A blank line is printed at the end of the TOC, this is so that if a preformatted text line with TOC (the label that gets a TOC printed) is added to the text, then it can be placed right up against the following text. That is so, if the text is converted with no TOC flagged, then there won't be an extra space in the output, the TOC prints that if the converted text is falgged to have a TOC added.
function print_toc(toc, _heading, _id, _indent, _size, _i) { _size = toc["size"] for (_i = 1; _i <= _size; _i++) { _heading = toc[_i] _indent = "" if (_heading ~ /^###/) { _indent = ". . . . . . " } else if (_heading ~ /^##[ \t]*/) { _indent = ". . . " } else if (_heading ~ /^#[ \t]*/) { _indent = "" } _heading = create_heading(_heading) _id = create_id(_heading) print body_padding "" escape_html(_indent _heading) "
" } print body_padding "" }
The heading is created by removing up to the first three leading "#"s, and then any leading and trailing spaces.
function create_heading(string) { sub(/^(###|##|#)[ \t]*/,"",string) sub(/^[ \t]+/,"",string) sub(/[ \t]+$/,"",string) return string }
The "create_id" helper function has two main sections. The first section creates an ID from the heading text, by triming all spaces from the start and end of the heading text. After that all spaces are replaced with dashes ("-"), and then all none dashes or alphanumeric characters are removed. Something like "This is a title yep really" would become "This-is-a-title-yep-really".
function create_id(string) { sub(/^[ \t]+/,"",string) sub(/[ \t]+$/,"",string) gsub(/[ \t]/,"-",string) gsub(/[^0-9a-zA-Z\-]/,"",string) return tolower(string) }
When the closing toggle line is detected, either nothing is printed (because there were no lines between the open and closing toggle lines), or the closing HTML "" tag is printed.
The other line types, as laid out in the gemtext spec, are simpler, like the "Unordered list items", as they have no state, and only apply on a per line basis.
The quote lines are just take a line and enclose it in the HTML "" tag. This allows a long quote to be wrapped, and behaves like a quoted paragraph.
/^>/ && preformat_toggle == false { sub(/^>/, "") print body_padding "" escape_html($0) "" next }
Heading lines are done in reverse order, to simplify the code and regexes. Most of the work is done in the "print_heading" function. Like the other simple line types, these don't get activated if they are in a preformatted section.
/^###/ && preformat_toggle == false { print_heading("h3", $0) next } /^##[ \t]*/ && preformat_toggle == false { print_heading("h2", $0) next } /^#[ \t]*/ && preformat_toggle == false { print_heading("h1", $0) next }
The "print_heading" helper function has two main sections. The first section creates the heading using the helper function, and then an ID from the heading text using another helper function which is incorporated into an HTML "id" attribute.
The creation of the ID is controlled by the "TOC" parameter, if it is passed with the value of "true" on the command line, it will trigger an ID creation and a link below each heading to jump to the top of the document. The idea being that the headings can then be linked to via HTML fragment links, perhaps from an included table of contents (TOC) of links at the top of the page.
-v TOC=true
The second section then prints out an HTML heading defined by what heading type was passed in via the "type" variable e.g. "h1", "h2" or "h3".
function print_heading(type, heading, _id) { heading = create_heading(heading) if (TOC == true) { _id = " id=\"" create_id(heading) "\"" } else { _id = "" } if (TOC == true) { print body_padding "<" type _id ">" escape_html(heading) "" type ">" } else { print body_padding "<" type _id ">" escape_html(heading) "" type ">" } }
The table of contents is created in the setup stage, and then printed when a preformatted section titled "TOC" is found. It essentially gets passed in the file that is being converted, so that it can loop through and find any headings that aren't in a preformatted section, and store them in the passed in table of contents array.
function create_toc(file, toc, _preformat_toggle, _toc_count) { _preformat_toggle = false _toc_count = 0 while (getline0) { if ($0 ~ /^```/) { if (_preformat_toggle == false) { _preformat_toggle = true } else { _preformat_toggle = false } } else if ($0 ~ /^(###|##|#)[ \t]*/ && _preformat_toggle == false) { toc[++_toc_count] = $0 } } toc["size"] = _toc_count close(file) }
Link lines just create an HTML link from the supplied URL and comment. It trims away the link chars "=>" used to denote a link line, and any spaces before the URL. The link and comment are then split matching on any spaces after the URL. If the "INLINE" flag has been set to true from the command line:
-v INLINE=true
Then the the URL is checked to see if it is an image, if so the link is turned into an image tage instead of a link, and the comment becomes the images alt title. Also if the "url" and the "link_name" are the same, a custom HTML data attribute is added called "data-noprint", so that the CSS media print type, knows not to add the URL to those links, as they already describe themselves in the text.
/^=>[ \t]*/ && preformat_toggle == false { sub(/^=>[ \t]*/, "") url = "" link_name = "" if (match($0,/[ \t]+/)) { url = substr($0,0,RSTART-1) link_name = substr($0,RSTART+RLENGTH) data_attribute = "" } else { url = $0 link_name = $0 data_attribute = "data-noprint " } if (INLINE == true && is_image(url) == true) { print body_padding "" } else { print body_padding "" } next }
The check image helper function, first trims the spaces from the start and end of the URL, and then chops the last four characters from the URL to make the suffix, and forces to lowercase at the same time. The suffix is then checked to see if it matches those used for Jpegs or PNGs. If it matches it returns true, otherwise false.
function is_image(url, _suffix) { sub(/^[ \t]+/,"",url) sub(/[ \t]+$/,"",url) _suffix = tolower(substr(url,length(url)-3)) if (_suffix == ".png" || _suffix == ".jpg") { return true } else { return false } }
The last line type is the text line type, and is the general one, as in, if it doesn't match any of the other lines types, it defaults to the text line type. The text line type has to pay attention to whether it is within a preformatted section or not, as it has to handle those cases slightly differently, hence the two sections of the if statement:
{ if (preformat_toggle == true) { if (preformat_start == true) { preformat_start = false if (preformat_title == "") { print body_padding "" } else { print body_padding "