[Netarchivesuite-users] NAS broad crawl questions

Peter Svanberg Peter.Svanberg at kb.se
Fri Sep 13 18:02:53 CEST 2019


This Wednesday at 11:02 we started our first NAS broad crawl, tadaa! (Pär has pictures showing Thomas and I pressing the mouse button, clicking on "Activate".)

It started well, with the job creation process. The first job, though, contained only one domain - maybe because it was special, with lots of non-default seeds. Then there was job two, containing 9999 domains, and then the process continued, with 10000 domains in each job.

After that, the first snapshot job started running. But after it was finished, no more snapshot jobs was started.

Later, our selective harvests started and run as scheduled. But when they were finished, nothing seems to happen in the job finishing and job starting area. The "All Running Jobs" page just contains job rows with a grey dot (crawl finished) and no host name. But the job creation process continues, with now soon 100 jobs with 10000 domains each.


1)     Do you have any hints on what could have happened? Is the admin host so occupied with job creation that it can't handle anything else? But it wasn't during the first hours. Where could we look to find out what could be wrong? (In log files, of course, but what should we look for?)

We will let the job creation be finished (which will happen approximately Sunday after 18) and see what then happens.

Then, concerning starting a broad crawl:


2)     We were advised to just have one harvester process running when the snapshot harvest is activated, which we did. But when could more processes be started? After the first snapshot job is started? Or should we wait until all jobs are created?

Regards,

-----

Peter Svanberg
Technical officer
Digital Collections Department, Newspapers, Radio and Television Division

National Library of Sweden
PO Box 5039
SE-104 51 Stockholm
Visits: Karlavägen 100, Stockholm
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190913/84bce098/attachment.html>


More information about the NetarchiveSuite-users mailing list