[Netarchivesuite-users] Timelimit – usage and NAS problems
Peter Svanberg
Peter.Svanberg at kb.se
Tue Oct 3 15:35:05 CEST 2023
(Continuation after Zoom meeting:)
So BNF use timelimit but is not influenced by the NAS insufficiency and Denmark does not use it. Anyone else use it?
And what did you mean, Tue, about the queue? At the time limit all harvesters are stopped and what was harvested so far is saved in the WARC and the job is DONE -- or? What was the problem? (Besides that some domains are falsely reported as timelimit stopped. But you can correct this in the database, which we did in one case.)
---
Peter Svanberg, Sweden
-----Ursprungligt meddelande-----
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> För Peter Svanberg
Skickat: den 2 oktober 2023 13:37
Till: netarchivesuite-users at ml.sbforge.org
Ämne: [Netarchivesuite-users] Timelimit – usage and NAS problems
Two things about timelimits.
1) When and how do you use timelimits in harvesting? It’s another way to limit the jobs. I suppose that stops mainly slow host – that maybe has figures for politeness in robots.txt, if you allow that to influence. Or host with many small objects, each delaying.
2) NAS has limitations in handling jobs stopped by timelimit. It checks for mentions of ”timelimit” on the last line in some Heritrix report and then reports timelimit for all domains which has not allready been stopped by data or object limits. Hence the statistics gets wrong. In our current broad crawl (pass 3) just 11 % of the domains were not ready when the jobs where timelimit stopped. Also, if there is another pass, all those falsely timelimit-reported domains is unnecessarily harvetested again.
This can be corrected in two ways:
A) NAS could look in the hosts-report files ”remaining” column to check which domains are stopped or not.
B) We could suggest/fix Heritrix to add a line in the log with a new Heritrix code when the queue for a domain gets empty. And then easily use that in NAS, as with objects and data limit codes.
I appreciate answers and comments.
Peter Svanberg
Sweden
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
More information about the NetarchiveSuite-users
mailing list