Notes on writing a network crawler.
A network crawler typically works as follows:
That's typical depth first recursion, here's some pseudo code:
having empty list of seen_urls function crawl(url) add url to seen_urls fetch and extract new_urls from page at url for new_url in (new_urls not in seen_urls) crawl(new_url) for url in seed_urls crawl(url)
Note: we need to keep a list of URLs we've "seen" to avoid going in loops; A -> B -> A etc.
This program will touch all of the reachable nodes eventually but what if the network has some really deep chains which might even go on forever? For instance, imagine a resource which renders all possible moves on a chess board; every page shows a board and has links to all possible moves. It's nice programming exercise, not too hard. The amount of positions is huge (look up Shannon number), when our crawler enters this, it will still be going through chess moves when our sun dies.
Our crawler is not interested in all chess moves and not smart enough to recognize it took a wrong turn. A basic rule we could apply is to go no deeper than some generously chosen amount of hops, we'll define a maximum depth. Let's revisit our program and put that in:
having empty list of seen_urls function crawl(url, depth) skip if depth is not below limit fetch and extract new_urls from page at url add url to seen_urls for new_url in (new_urls not in seen_urls) crawl(new_url, depth + 1) for url in seed_urls crawl(url, 0)
Note: the generosity allowed for the maximum depth depends on how deep our programming language allows use to recurse, since this is not a tail call it depends on the stack size. We could rewrite it to avoid this problem but more on that later.
Selecting a value for maximum depth depends on: the pace of the crawler, the estimated amount of pages which will be crawled, the average depth, and in what amount of time it's reasonable for our crawler to finish. That's a lot of parameters.. We could try and figure out how to automatically adjust it after every run, but that doesn't seem like a trivial problem.
A down side to limiting depth in this manner is that all exploration of new territory (not part of our seed URLs) will be relatively shallow. How do we fix that?
We could reset the depth counter very time we wander into another domain. This seems like an obvious solution but consider the chess problem again and image implementing it using two domains, one for white moves, the other for black, our crawler is back on its eternal quest. We could try to do some pattern detecting but staying out of traps like this will takes serious AI.
We could try and automatically promote some of the new territory to our list of start URLs. This will allow the crawler to traverse the new territory deeper in subsequent runs.
But on what basis should we select URLs for promotion? Root URLs could be good candidates when encountered but what about all those tilde sites, what's "root" for them and should we be able recognize those automatically? There's probably plenty of other kinds of islands on domains which are hard to recognize.
Another issue: will we actually find these "roots"? A lot of pages deeplink to some other page on some new domain. How big are the chances these new pages links to its "root"?
This maximum depth approach seems to have a lot of issues. I feel we should leave it and look for other solutions.
So what if we, instead of putting a limit on depth, put a limit on the amount of pages hit per domain per run? That way all domains are treated equally.
Our recursive code isn't very suitable for this (and suffers from the stack issue) so let's flatten it and fix the issue:
having remaining_urls list with seed urls having empty list of seen_urls while any remaining_urls take url from start of remaining_urls fetch and extract new_urls from page at url add url to seen_urls remove all urls also in seen_urls from new_urls remove all urls also in remaining_urls from new_urls append new_urls to remaining_urls
The recursion is replaced by a loop and instead of going depth first, we're going breadth first. An interesting property of crawling breadth first is that we're first hitting all our seed URLs before going deeper and, possibly, getting lost in some wormhole. This seems like a big improvement on our recurve version.
depth first breadth first 1 1 | | ---------------- ---------------- | | | | | | 2 5 9 2 3 4 | | | | ----- ----- ----- ----- | | | | | | | | 3 4 6 7 5 6 7 8 | | 8 9
This is still, potentially, an endless loop. Next, we'll keep track of the domains we've seen and put a limit on the amount of times we'll hit them.
having remaining_urls list with seed urls having empty list of seen_urls having domain hit counters reset domain hit counters while any remaining_urls having domain hit counter below limit take first url of remaining_urls having domain hit counter below limit fetch and extract new_urls from page at url add url to seen_urls increment hit counter for url domain remove all urls also in seen_urls from new_urls remove all urls also in remaining_urls from new_urls append new_urls to remaining_urls
Note: we'll keep collecting URLs on domains for which the domain counter already reached the limit but won't visit them.
Now our program will hit all domains no more than a specified amount of times. The URLs remaining are from the domains hosting a lot of pages. But what should we do with those? A couple of extra churns with the remaining URLs after resetting domain hit counters until there's only "a couple" of different domains left?
In our first breadth first version we kept taking from the remaining URLs list until it was depleted and call it a day. In our second version we only take URLs when they can still be traversed (domain hit counter still below limit) and we end up with a list a URLs which still need attention.
When we'd like to do a subsequent run to refresh our knowledge of the network we'd need run through our seed URLs again but what about the remain URLs from the previous run? If we start with the seed URLs we'll probably end up with the same remaining URLs, so those will never be visited. Appending the seed URLs to the remaining URLs works a lot better but we can't be a 100% sure all of our seed URLs will be visited and refreshed in this run.
Let's add aging to the mix. We'll record the necessary timestamps for all the pages we're hitting and use that to determine what's stale and what's not.
having seeded list of known_urls having domain hit counters eligible: having domain hit counter below limit and not stale reset domain hit counters while any eligible known_urls borrow first eligible known_urls fetch and extract new_urls from page at url record timestamp for url increment hit counter for url domain remove all urls also in known_urls from new_urls append new_urls to known_urls
Note: instead of taking (and thus deleting) from our list we now "borrow" (that's the best verb I can come up with). Also note: when the crawler does "record timestamp for url" it is no longer stale thus no longer eligible for a new visit during this run.
On the first run we'll start with the seed URLs. In subsequent runs we'll feed it whatever we've recorded earlier. Every run starts by resetting the domain hit counters. New seeds will be added to the beginning of the list to ensure they'll be picked up as soon as possible.
Keeping domain counters seems like the most obvious thing to partition on to ensure the crawler finishes at some point but we can make that a bit smarter. Maybe we'll want to give the tilde domains a bit more space by extending "domain" to include the ~username bit for instance.
The definition of "stale" allows for some logic like: different rules for different response types, recognizing news pages which change often, increasing "wait time" when a resource doesn't seem to change, etc.
The known URL list needs to be cleaned up periodically to delete all the pages which became unconnected and are yielding errors when fetched (DNS errors, connection refused, not found, etc). By keeping some stats on pages which disappeared but are still linked to, we can even provide some kind of deadlink detection service.
Are we there yet? No. But I think we made some progress. Going breath first is a big win. Knowing when to stop is still an issue but the domain hit counter seem like a nice start to me.
--
📅 2020-11-09
📧 hello@rwv.io
CC BY-NC-SA 4.0
text/gemini; lang=en
This content has been proxied by September (ba2dc).