[Netarchivesuite-devel] Seed config limit

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Jan 29 09:21:39 CET 2020


Hi Andreas,
I'm not sure you got any answer to this question.
But I was going through the heritrix slack channel and saw this:
https://iipc.slack.com/archives/C2F63EUV7/p1567535683011800
Also, I think our colleagues from KB-DK are using a similar bean for this.
Sara



De :    aponb at gmx.at
A :     "netarchivesuite-devel at ml.sbforge.org" 
<netarchivesuite-devel at ml.sbforge.org>
Date :  10/01/2020 23:23
Objet : [Netarchivesuite-devel] Seed config limit
Envoyé par :    "Netarchivesuite-devel" 
<netarchivesuite-devel-bounces at ml.sbforge.org>



How are you handle domains with many seeds during a crawl? For example I
am doing a crawl with the domain wordpress.com and I have in my default
seed list 100 Seeds (host1.wordpress.com to host100.wordpress.com) and
apply a limit of 100 MB. The crawl will be start with all seeds and of
course it will be finished by reaching the domain-config-limit of 100
MB. So many seeds were just only touched not more. What I really want is
to have a seed-config-limit of 100 MB. How can I reach this? How can I
enforce a limit by seed? Do you have any ideas?

Regards

a.

_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel


Exposition  Tolkien, voyage en Terre du Milieu  - du 22 octobre 2019 au 16 février 2020 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20200129/d14f2741/attachment.html>


More information about the Netarchivesuite-devel mailing list