If you are reading this via Gopher and it looks a bit different, that's because I spent the past few hours (months?) working on a new method to render HTML (HyperText Markup Language) into plain text. When I first set this up [1] I used Lynx [2] because it was easy and I didn't feel like writing the code to do so at the time. But I've never been fully satisfied at the results [Yeah, I was never a fan of that either –Editor]. So I finally took the time to tackle the issue (and is one of the reasons I was timing LPEG (Lua Parsing Expression Grammar) expressions [3] [DELETED-the other day-DELETED] [Nope. –Editor][DELETED- … um … the other week-DELETED] [Still nope. –Editor][DELETED- … um … a few years ago?-DELETED] [Last month. –Editor] [Last month? –Sean] [Last month. –Editor] [XXXX this timeless time of COVID-19 –Sean] last month).
The first attempt sank in the swamp. I wrote some code to parse the next bit of HTML (it would return either a string, or a Lua table containing the tag information). And that was fine for recent posts where I bother to close all the tags (taking into account only the tags that can appear in the body of the document, , , , , , , , . , and do not require a closing tag), but in earlier posts, say, 1999 through 2002, don't follow that convention. So I was faced with two choices—fix the code to recognize when an optional closing tag was missing, or fixing over a thousand posts.
It says something about the code that I started fixing the posts first …
I then decided to change my approach and try rewriting the HTML parser over. Starting from the DTD (Document Type Definition) for HTML 4.01 strict [4] I used the re module [5] to write the parser, but I hit some form of internal limit I'm guessing, because that one burned down, fell over, and then sank into the swamp.
I decided to go back to straight LPEG, again following the DTD to write the parser, and this time, it stayed up.
It ended up being a bit under 500 lines of LPEG code [6], but it does a wonderful job of being correct (for the most part—there are three posts I've made that aren't HTML 4.01 strict, so I made some allowances for those). It not only handles optional ending tags, but the one optional opening tag I have to deal with— (yup—both the opening and closing tag are optional). And tags cannot contain tags while preserving whitespace (it's not in other tags). And check for the proper attributes for each tag.
Great! I can now parse something like this:
This is my blog. Is this not nifty?
Yeah, I thought so.
into this:
tag = { [1] = { tag = "p", attributes = { }, block = true, [1] = "This is my ", [2] = { tag = "a", attributes = { href = "http://boston.conman.org/", }, inline = true, [1] = "blog", }, [3] = ". Is it not ", [4] = { tag = "em", attributes = { }, inline = true, [1] = "nifty?", }, }, [2] = { tag = "p", attributes = { }, block = true, [1] = "Yeah, I thought so.", }, }
I then began the process of writing the code to render the resulting data into plain text. I took the classifications that the HTML 4.01 strict DTD uses for each tag (you can see the tag above is of type block and the and tags are type inline) and used those to write functions to handle the approriate type of content— can only have inline tags, only allows block type tags, and can have both; the rendering for inline and block types are a bit different, and handling both types is a bit more complex yet.
The hard part here is ensuring that the leading characters of (wherein the rendered text each line starts with a “| ”) and of the various types of lists (dictionary, unordered and ordered lists) are handled correctly—I think there are still a few spots where it isn't quite correct.
But overall, I'm happy with the text rendering I did, but I was left with one big surprise [7] …
=> [1] /boston/2018/01/09.1 | [2] http://lynx.browser.org/ | [3] /boston/2020/06/05.1 | [4] https://www.w3.org/TR/html4/strict.dtd | [5] http://www.inf.puc-rio.br/~roberto/lpeg/re.html | [6] /boston/2020/07/04/html.lua | [7] /boston/2020/07/04.2
=> Gemini Mention this post | Contact the author This content has been proxied by September (ba2dc).Proxy Information
text/gemini