[Netarchivesuite-users] Timelimit – usage and NAS problems
Peter Svanberg
Peter.Svanberg at kb.se
Fri Oct 20 18:46:19 CEST 2023
Yes, the harvesting stops when the timelimit is reached -- what else did you expect? :-)
The purpose of setting a timelimit is, I suppose, to stop very slow or tracked/looping harvests. Maybe you meant that you don't have any need for that? Or did you expect some other timelimit behaviour?
And as I said (and maybe you alluded to) there is a problem with false-reporting timelimit for already completed domains. Can be fixed in the database, but you have to gather info from the metadata WARC files first, to know what jobs was actually stopped. I have Python scripts. But this could be fixed quite easily in NAS.
Thanks for an update of your current schedule and figures!
Side track: We also observed that NAS reports "Domain completed" even if the domain doesn't exist (DNS error). Maybe that case should be a separate stop reason?
-----
Peter Svanberg, Sweden
-----Ursprungligt meddelande-----
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 3 oktober 2023 17:13
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Timelimit – usage and NAS problems
If we use the timelimit for a broadcrawl job it cuts the domain seed queue crawlling - eventhough it is not finished harvesting of all the domain's in the job. You can see it in Heritrix seeds or host reports. We are using a harvesting policy different from BNF's.
So if we used timelimit for a job we will loose a lot of content especially in step 2. We have tried it - it was stopped after a couple of days. We have jobs in step 2 which runs for more weeks because of the size of the domains.
We have up to 10K domains in each job (domains are grouped e.g. by order templates hops etc.) in our 2 broad crawl steps.
Each step has an overall maxbyte per domain (even though the maxbyte can be set higher on the domain level) : 50MB for step 1 and 16 G for step 2. In step 1 all domains in the jobdbs (about 3 mio) are crawled even though they are inactive or active. Only the domains which hit the 50MB limit are included in step 2 with an overall 16G maxbyte limit per domain. Lower bytelimits on domain level have higher priority than the overall step limit. The 2 steps schedules about 500-800 jobs. Step 1 runs about 10 days on 110 crawlers in parallel without job curating and harvest about 12-15 TB. Step 2 runs on the same number of harvesters for about 6 weeks with job curating and harvest about 80-100 TB before dedup and compress. A broadcrawl is done 4 times per year with "maintenance windows" - between 2 weeks to 1 month per quarter.
Best regards
Tue
-----Original Message-----
From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Tuesday, 3 October 2023 15.35
To: 'netarchivesuite-users at ml.sbforge.org' <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>
Subject: Re: [Netarchivesuite-users] Timelimit – usage and NAS problems
(Continuation after Zoom meeting:)
So BNF use timelimit but is not influenced by the NAS insufficiency and Denmark does not use it. Anyone else use it?
And what did you mean, Tue, about the queue? At the time limit all harvesters are stopped and what was harvested so far is saved in the WARC and the job is DONE -- or? What was the problem? (Besides that some domains are falsely reported as timelimit stopped. But you can correct this in the database, which we did in one case.)
---
Peter Svanberg, Sweden
-----Ursprungligt meddelande-----
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Peter Svanberg
Skickat: den 2 oktober 2023 13:37
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: [Netarchivesuite-users] Timelimit – usage and NAS problems
Two things about timelimits.
1) When and how do you use timelimits in harvesting? It’s another way to limit the jobs. I suppose that stops mainly slow host – that maybe has figures for politeness in robots.txt, if you allow that to influence. Or host with many small objects, each delaying.
2) NAS has limitations in handling jobs stopped by timelimit. It checks for mentions of ”timelimit” on the last line in some Heritrix report and then reports timelimit for all domains which has not allready been stopped by data or object limits. Hence the statistics gets wrong. In our current broad crawl (pass 3) just 11 % of the domains were not ready when the jobs where timelimit stopped. Also, if there is another pass, all those falsely timelimit-reported domains is unnecessarily harvetested again.
This can be corrected in two ways:
A) NAS could look in the hosts-report files ”remaining” column to check which domains are stopped or not.
B) We could suggest/fix Heritrix to add a line in the log with a new Heritrix code when the queue for a domain gets empty. And then easily use that in NAS, as with objects and data limit codes.
I appreciate answers and comments.
Peter Svanberg
Sweden
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20231020/fa20cf1b/attachment.html>
More information about the NetarchiveSuite-users
mailing list