[Netarchivesuite-users] Number of domains/job recommendation?

Peeter Rahuvarm peeter.rahuvarm at nlib.ee
Fri Jan 24 12:00:22 CET 2020


Hello Peter

I have heard numbers from 1000 to 400000+. There have been also approaches to differentiate sites by size and put less bigger or more smaller sites to one job.

When we were about to start our first broad crawl 4 years ago, we had a big unawareness. Finally we ended up with number 1 and so far we are happy. Now we are doing about 100.000 Heritrix jobs a year and mostly one seed in each job (plus redirecters that we detect beforehand). We have 1G size limit for job but ca 90-95% of jobs end up earlier.

I don't know if Netarchivesuite is meant to work like this (we are using a self developed management system).

--
Peeter Rahuvarm
National Library of Estonia
________________________________
Saatja: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> Peter Svanberg <Peter.Svanberg at kb.se> nimel
Saadetud: reede, 24. jaanuar 2020 10:14
Adressaat: netarchivesuite-users at ml.sbforge.org <netarchivesuite-users at ml.sbforge.org>
Teema: [Netarchivesuite-users] Number of domains/job recommendation?

Another short question: what do you think is a good level (*) on the number of domains per job on a broad crawl?

We started a crawl being unaware of that we had a strange value on maxTotalSize. That lead to allmost 13000 jobs with on average 37 domains in each – not so good. (But it made us learn how to stop a broad crawl.)

What should we choose?

Regards,

Peter Svanberg
National library of Sweden

(*) I would have liked to use the very Swedish word "lagom" here ...
https://en.wikipedia.org/wiki/Lagom

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20200124/fe67bb4c/attachment.html>


More information about the NetarchiveSuite-users mailing list