[Netarchivesuite-users] NAS/Heritrix doesn't obey byte limits

Peter Svanberg Peter.Svanberg at kb.se
Mon Mar 18 13:38:12 CET 2019


Yes, okay, I did look at such a warc file last week, forgot that now.

Thank you all for your quick answers, we'll do tests right away!
-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se


Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> För Colin Samuel Rosenthal
Skickat: den 18 mars 2019 13:12
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] NAS/Heritrix doesn't obey byte limits


Hi Peter,



The crawl logs and all the other logs and reports are packaged up in a metadata warc-file which is uploaded to the archive along with the harvested data. When you find this, the hosts report is usually very informative because it shows how many objects and bytes are harvested for each host.



/Colin


--
Colin Rosenthal PhD
Senior IT Consultant
Royal Danish Library (Aarhus)

________________________________
From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> on behalf of Peter Svanberg <Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>>
Sent: Monday, March 18, 2019 1:03 PM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: Re: [Netarchivesuite-users] NAS/Heritrix doesn't obey byte limits


Thanks, Tue



The crawl.log file has one line for each URL which Heritrix has tried to fetch, I assume.



But that file (and surrounding files) seem to disappear when the job is done, right? So you have to monitor during harvest?



/Peter



Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 18 mars 2019 11:10
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] NAS/Heritrix doesn't obey byte limits



Hello Peter



In the H3 crawl.log.



You can find it during the harvest  f.x. here:

harvester_low_8090/307071_1552863418226/heritrix3/jobs/307071_1552863418226/logs/crawl.log

or  in Netarchivesuite GUI under the job reports



It will tell you why  :)



Best regards

Tue



From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, March 18, 2019 11:01 AM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: Re: [Netarchivesuite-users] NAS/Heritrix doesn't obey byte limits



Hello Tue!



Not much, the ones that I've seen. Which kind of log/filename should I check? Should we change log level? How? What should I look for in the log?

(I've got sysadmin and programming experience but I'm new with this system. So I'm eager to learn more!)

/Peter



18 mars 2019 kl. 10:07 skrev Tue Hejlskov Larsen <tlr at kb.dk<mailto:tlr at kb.dk>>:

Hello Peter



What does your crawllogs tell you?



Best regards

Tue



From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, March 18, 2019 12:07 AM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: [Netarchivesuite-users] NAS/Heritrix doesn't obey byte limits



Hello, NAS users and others!



We are experiencing a very strange behavior from NAS/Heritrix (see attached Excel file, with comments):



The harvests reports says "Stopped due to ... byte/object limit reached" at very different levels - sometimes much above (more than five times the limit), sometimes much below the limit. We fail to see any pattern in this, it seems more or less random.



What are we doing wrong? Is it some error in the harvest template? (Attached below each table.)



Or, if it is some kind of bug, are there workarounds?



We would much appreciate any hints, as this is quite a problem for us, both for the on-going selective harvests and the upcoming big snapshot run!



(We are running version 5.4.2, I hope that it doesn't affect this problem, as we can't upgrade now.)



Best regards,

-----

Peter Svanberg
Technical officer
Digital Collections Department, Newspapers, Radio and Television Division

National Library of Sweden
PO Box 5039
SE-104 51 Stockholm
Visits: Karlavägen 100, Stockholm
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se/>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190318/620ae578/attachment.html>


More information about the NetarchiveSuite-users mailing list