This page permanently redirects to gemini://gemini.bortzmeyer.org/software/lupa/.
Lupa is a Gemini crawler. Starting from a few
given URLs, it retrieves them, analyzes the links in gemtext (Gemini
format) files and adds them to the database of URLs to crawl. It is
not a search engine, it does not store the content of the resources,
just the metadata, for research and statistics.
The instance of the crawler that I manage currently operates from 2001:41d0:302:2200::180
(and 193.70.85.11
on the old networks).
=> See the current statistics | Logbook of the production crawler | Previous statistics
If you want the list of capsules known to Lupa:
=> As a text file | As a gemtext, with links
If you notice a missing capsule, write me (address at the end of this
page).
If you want the entire content of the database, you'll have to write
me (address at the end of this page) and explain why. I tend to be
liberal with such requests since, after all, it is public data and
anyone could gather it.
Lupa is written in:
=> Python
No real installation procedure, you have to get the sources, put them
where you want and setup PYTHONPATH and PATH. Pre-requisites (all of
them on PyPi): psycopg2, pyopenssl, scfg, public_suffix_list and
agunua.
(On a Debian machine, the packaged prerequitises are packages python3-pip,
python3-psycopg2 and python3-openssl, agunua, public_suffix_list and
scfg have to be installed with pip or manually.)
Usage requires a PostgreSQL database,
to store the URLs and the result of crawling. Once you've created the
database, prepare it with the create.sql
file:
createdb lupa psql -f ./admin-scripts/create.sql lupa export PYTHONPATH=$(pwd) ./admin-scripts/lupa-insert-url gemini://start.url.example/ ./admin-scripts/lupa-insert-url gemini://second-start.url.example/
=> PostgreSQL
At the present time, you need a separate script to retrieve robots.txt
exclusion files. It is not done by the crawler. This script must be
run from time to time, for instance from cron, every two hours:
./admin-scripts/lupa-add-robotstxt
You run the crawler with ./lupa-crawler
. The crawler does not run
forever, you need to start it from cron. Locking is done by the
database, so it is not an issue if two instances run at the same time.
You can have a list of options with --help
but, at this time, you
need to read the source to understand them. Some interesting options:
--num
: maximum number of URLs to test. It is very low by default,
to allow testing, so you may want to set it to a more reasonable
value such as 1000.
--among
: number of URLs among which the "num" before are choosen
at random. You typically set it to the size of the database, but it
can be smaller.
--sleep
: by default, the crawler goes as fast as possible but you
can slow it down with this parameter. Between two URLs, the crawler
will sleep at a time randomly choosen between 0 and this number of
seconds.
--old
: the crawler retrieves the URLs that has never been
retrieved or retrieved more than this number of days. Default is 14 days.
--maximum
: the crawler has a maximum time of running, to be sure
it is not blocked forever if there is a blocking operation. It is
one hour by default.
--debug
: makes the log more talkative.
Also, you can use a configuration file, using the scfg syntax. An
example is in the sources, sample-lupa.conf
.
=> scfg
A log file is created in /var/tmp/Lupa.log
. It is up to you to
ensure it is replaced from time to time.
Lupa means she-wolf in latin. It refers to the wolf who took care of
the twins Romulus and Remus. (Many Gemini programs have names related
to twins, gemini in latin.)
=> On Gemini
Stéphane Bortzmeyer stephane+gemini@bortzmeyer.org
text/gemini; lang=en
This content has been proxied by September (ba2dc).