[Netarchivesuite-devel] Host queues instead of domain queues on blogspot.com

sara.aubry at bnf.fr sara.aubry at bnf.fr
Mon Sep 24 14:37:08 CEST 2018


Dear all,

Since last year, some of our broad crawl  jobs are disrupted by blogspot.
Heritrix creates queues per host (aaa.blogspot.com, bbb.blogspot.com...) 
instead of
keeping them per domain (we are using 
dk.netarkivet.harvester.harvesting.DomainnameQueueAssignmentPolicy).

We first thought this problem was introduced by the use of the 
public_suffix.dat file where blogspot.com and country versions
were stated as TLDs. But we still have this problem although we took them 
off this list.

For some jobs, the amount of discovered blogspot queues can go up to the 
point it will crash the job.

Does anyone have the same problem?

Sara

Participez à la rénovation de Richelieu Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20180924/c2824bf6/attachment.html>


More information about the Netarchivesuite-devel mailing list