[Netarchivesuite-users] Job generation/split settings

Peter Svanberg Peter.Svanberg at kb.se
Fri Jul 5 15:39:03 CEST 2019


Hello!

In trying to understand how NAS splits jobs, I found the settings.xml file, and have analysed those parameters, see attached file. Have anyone changed any of these values from the distributed defaults? If so, describe.

Specifically, when you do your first broad crawl and all configurations are the same, it will (as I interpret the code) split either when number of configurations is 10000 (domainConfigSubsetSize) or when the estimated number of objects is 8 000 000 (maxTotalSize). (With 50 GByte byte limit this would be about 6000.) Up to 10 000 domains/configurations in one job, is this normal for a broad crawl job?

And to Denmark/Tue: Is this another reason to set individual limits on domains - to make domains with equal limits be put in the same job? (And maybe I found a bug in the non-default generator, see red text.)

Regards,

-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190705/00a27caf/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: settings_xml.pdf
Type: application/pdf
Size: 14136 bytes
Desc: settings_xml.pdf
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190705/00a27caf/attachment-0001.pdf>


More information about the NetarchiveSuite-users mailing list