██           ▄▄                                   ██
                        ▀▀           ██                                   ▀▀
 ▄▄█████▄  ██▄███▄    ████      ▄███▄██   ▄████▄    ██▄████  ██▄███▄    ████      ▄███▄██
 ██▄▄▄▄ ▀  ██▀  ▀██     ██     ██▀  ▀██  ██▄▄▄▄██   ██▀      ██▀  ▀██     ██     ██▀  ▀██
  ▀▀▀▀██▄  ██    ██     ██     ██    ██  ██▀▀▀▀▀▀   ██       ██    ██     ██     ██    ██
 █▄▄▄▄▄██  ███▄▄██▀  ▄▄▄██▄▄▄  ▀██▄▄███  ▀██▄▄▄▄█   ██       ███▄▄██▀  ▄▄▄██▄▄▄  ▀██▄▄███
  ▀▀▀▀▀▀   ██ ▀▀▀    ▀▀▀▀▀▀▀▀    ▀▀▀ ▀▀    ▀▀▀▀▀    ▀▀       ██ ▀▀▀    ▀▀▀▀▀▀▀▀   ▄▀▀▀ ██
           ██                                                ██                   ▀████▀▀

Spiderpig Indexing Service

2022-09-08 | #tilde.wtf #spiderpig #golang #postgresql #search

I'm currently heads down building out the new infrastructure and code for repurposing my tilde.wtf domain into a search engine for the tildeverse. The first component will be the indexing service that's responsible for crawling the tildeverse and storing what it finds in a database that's easily searchable. I can think of a no more cromulent name than spiderpig. It just embiggens the soul.

I'm starting from the foundation of Go as the programming language and PostgreSQL as the database backing it. The use of Postgres I think is a bit of overkill here due to the size of the tildeverse -- it is referred to as the smolnet for a reason -- but I've grown rather fond of Postgres over the years and enjoy the plethora of features it enables developers to have. Plus it has legendary reliability that has proven itself over and over again.

Database

I long ago swore off ORMs and use SQL directly so as to tap into those features directly. To begin, I have the current schema of the database as:

-- Table Definition ----------------------------------------------

CREATE TABLE tildes (
    id uuid DEFAULT gen_random_uuid() PRIMARY KEY,
    url text NOT NULL UNIQUE,
    title text NOT NULL,
    body text NOT NULL,
    crawled_on timestamp with time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
    searchtext tsvector
);

-- Indices -------------------------------------------------------

CREATE UNIQUE INDEX tildes_pkey ON tildes(id uuid_ops);
CREATE UNIQUE INDEX tildes_url_key ON tildes(url text_ops);
CREATE INDEX searchtext_gin ON tildes USING GIN (searchtext tsvector_ops);

-- Triggers -------------------------------------------------------

CREATE TRIGGER ts_searchtext
  BEFORE INSERT OR UPDATE ON public.tildes
  FOR EACH ROW
  EXECUTE FUNCTION pg_catalog.tsvector_update_trigger(searchtext, pg_catalog.english, title, body);

Pretty standard stuff being captured here with the url, title, and body coming from the source being crawled. The id for the primary key is an auto-generated UUID and crawled_on is the current timestamp. I'm making use of a few advanced features of Postgres here. The first is with the searchtext column being of type tsvector. From the docs: "The tsvector type represents a document in a form optimized for text search".

In the indexes for the columns, I'm using a GIN (generalized inverted index) index on the searchtext column. This is the same type of index used by Lucene for full text searching. Below that I also set up a trigger so that any time a row is inserted or updated here, we actually update the searchtext column automatically to ensure we can make use of the fancy full text search Postgres offers us. Unfortunately for now, I am only going to be supporting the English language -- but depending on how much of the tildeverse crawled ends up being non-English, will reevaluate this decision later.

Crawler

To sanely crawl the HTTP tildeverse, you need to essentially build a scraper that can understand and parse HTML effectively, be able to follow links, and bound itself to not enter an infinite loop. Luckily, the Go community has a fantastic open source library for exactly this purpose: Colly.

=> https://github.com/gocolly/colly

This is such a simple and clean library that makes it a joy to build your own customized scrapers for any need you have. It also allows you to build an allowlist of domains to prevent ingesting content you don't want. I make heavy use of that with the following restrictions currently:

  "AllowedDomains": [
    "cosmic.voyage",
    "ctrl-c.club",
    "envs.net",
    "hackers.cool",
    "pebble.ink",
    "protocol.club",
    "rawtext.club",
    "remotes.club",
    "rw.rs",
    "heathens.club",
    "tilde.cafe",
    "tilde.guru",
    "trash.town",
    "tilde.green",
    "hextilde.xyz",
    "piepi.art",
    "skylab.org",
    "squiggle.city",
    "tilde.team",
    "tilde.fun",
    "tilde.institute",
    "tilde.pink",
    "tilde.town",
    "tilde.wiki",
    "aussies.space",
    "tildeverse.org",
    "tilde.wtf"
  ],

One side effect though is sometimes a tilde will be running a wiki that can end up generating a lot of useless links to repetitive data (history, versions, etc.) for their pages. So I proactively try to get ahead of that and have also built a series of globs to look for in the URL and avoid indexing those:

  "URLDenyList": [
    "wiki/index.php?title=Special",
    "&diff=",
    "&action=",
    "&oldid=",
    "Special:"
  ],

System

The infrastructure behind tilde.wtf will be Linux (specifically CentOS due to the heavy legwork they do for SELinux policies out of the box) and as a result will be systemd driven. In the case of spiderpig, it tries to follow the spirit of unix tooling to only do one thing and to do it well. As a result, it's not the component that's meant to be running at all times. It is to be invoked, spider the tildeverse, and then exit and wait until next time.

In the traditional unix world, things like this are handled by cron. You set your command to run at a regularly scheduled interval via a cronjob, and it handles running the command at the right time. It has existed for decades and will continue to exist. However, for this project I actually went with the use of systemd timers to 1.) get exposure to them and 2.) to ensure security through the use of systemd's sandboxing features.

Configuration of systemd is always file based, whether it's a service or a timer. So to create a timer, we need to create a .timer file located in /etc/systemd/system.

Here's spiderpig.timer:

[Unit]
Description=Spiderpig indexing schedule

[Timer]
OnCalendar=*-*-* 6:25:00
Unit=spiderpig.service

[Install]
WantedBy=basic.target

The OnCalendar portion operates similar to cron which in this case is saying we will invoke the spiderpig.service every day at 06:25 UTC.

Here's spiderpig.service:

[Unit]
Description=Spiderpig Indexing Service

[Service]
Type=simple
WorkingDirectory=/opt/spiderpig
ExecStart=/opt/spiderpig/spiderpig -config /opt/spiderpig/conf.json
KillMode=process

DynamicUser=true
NoNewPrivileges=yes
PrivateTmp=yes
PrivateDevices=yes
DevicePolicy=closed
ProtectSystem=strict
ProtectHome=yes
ProtectControlGroups=yes
ProtectKernelModules=yes
ProtectKernelTunables=yes
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6 AF_NETLINK
RestrictRealtime=yes
RestrictNamespaces=yes
MemoryDenyWriteExecute=yes

[Install]
WantedBy=multi-user.target

Notice the litany of options for restricting the process as it runs. One of my favorites is the DynamicUser option which has systemd create an ad-hoc and ephemeral user to run the binary as. Things like ProtectSystem being set to strict ensure the entire worlds of /usr, /etc, and so on are mounted read-only for the process. Spiderpig is only responsible for crawling the tildeverse and saving those entries to the Postgres database via the localhost network. There's no need for it to write files anywhere and all logs are emitted to stdout and thus captured by systemd itself.

This is a much stronger security posture for running something that goes out visiting random pages on the internet.

I know systemd gets a lot of hate around the internet, and I do take issue sometimes with its kitchen sink approach to trying to do everything (why does it have dns and ntp functionality?!). But as a sysadmin/developer, it DOES in fact make my life easier and my services more secure and reliable with minimal effort on my part. It's hard to argue with results.

Current Progress

I do have spiderpig successfully crawling the tildeverse now as I continue to tweak it. When I start the crawling process, the root begins with this wiki page:

=> https://tilde.wiki/wiki/Known_tildes

Without any major optimization, each crawl session currently takes about 8 hours to complete:

○ spiderpig.service - Spiderpig Indexing Service
     Loaded: loaded (/etc/systemd/system/spiderpig.service; disabled; vendor preset: disabled)
     Active: inactive (dead) since Thu 2022-09-08 14:21:02 UTC; 7h ago
   Duration: 7h 55min 58.165s
TriggeredBy: ● spiderpig.timer

And the number of pages that actually get indexed is also surprisingly low:

tilde=# select count(*) from tildes;
 count
-------
 56866
(1 row)

I'll continue to tweak this to ensure it's stable and doesn't hammer the tilde servers aggressively. But I'll be moving on to the API component next.

Proxy Information
Original URL
gemini://tilde.club/~cyrus/2022-09-08-spiderpig.gmi
Status Code
Success (20)
Meta
text/gemini; lang=en
Capsule Response Time
432.768532 milliseconds
Gemini-to-HTML Time
0.819898 milliseconds

This content has been proxied by September (ba2dc).