[Netarchivesuite-devel] Host queues instead of domain queues on blogspot.com
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Mon Sep 24 14:37:08 CEST 2018
Dear all,
Since last year, some of our broad crawl jobs are disrupted by blogspot.
Heritrix creates queues per host (aaa.blogspot.com, bbb.blogspot.com...)
instead of
keeping them per domain (we are using
dk.netarkivet.harvester.harvesting.DomainnameQueueAssignmentPolicy).
We first thought this problem was introduced by the use of the
public_suffix.dat file where blogspot.com and country versions
were stated as TLDs. But we still have this problem although we took them
off this list.
For some jobs, the amount of discovered blogspot queues can go up to the
point it will crash the job.
Does anyone have the same problem?
Sara
Participez à la rénovation de Richelieu Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20180924/c2824bf6/attachment.html>
More information about the Netarchivesuite-devel
mailing list