[Netarchivesuite-users] Limit both number of bytes and number of objects per domain

sara.aubry at bnf.fr sara.aubry at bnf.fr
Thu Sep 1 09:13:40 CEST 2022


Hi Peter,

1)      Have you used positive values on objects max?
We used a set of positive values in "Maximum number of objects" defined in 
configurations (usually 50,000, 100,000 or 150,000). 

2)      Have you changed 
settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcer or is 
it true ?
objectLimitIsSetByQuotaEnforcer is set to false

3)      What is your value on 
settings.harvester.harvesting.harvestReport.class ? BnfHarvestReport or 
LegacyHarvestReport
settings.harvester.harvesting.harvestReport.class is set to 
BnfHarvestReport

Best,

Sara



De :    "Peter Svanberg" <Peter.Svanberg at kb.se>
A :     "netarchivesuite-users at ml.sbforge.org" 
<netarchivesuite-users at ml.sbforge.org>
Date :  31/08/2022 17:11
Objet : Re: [Netarchivesuite-users] Limit both number of bytes and number 
of objects per domain
Envoyé par :    "NetarchiveSuite-users" 
<netarchivesuite-users-bounces at ml.sbforge.org>



Hi Sara,
 
This was interesting! Lost of questions:
 
1)      Have you used positive values on objects max?
2)      Have you changed 
settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcer or is 
it true ?
3)      What is your value on 
settings.harvester.harvesting.harvestReport.class ? BnfHarvestReport or 
LegacyHarvestReport
 
The template examples in NAS have both frontier and quotaenforcer, but 
with this comment.
 
## Can be used instead of the QuotaEnforcer module. In this case the 
following line should look 
## like: 
frontier.queueTotalBudget=%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}
## instead of: frontier.queueTotalBudget=
 
frontier.queueTotalBudget=%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}
 
(Somewhat later:) Now I see, your statement makes me understand the 
handling in configureQuotaEnforcer(): it makes it possible to have both, 
by setting the value of the one that shouldn’t be used to infinity.
 
But you must have False in (2) and BnfHarvestReport in (3) above, or else 
I’m puzzled again. J
 
-----
Peter

 
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> 
För sara.aubry at bnf.fr
Skickat: den 30 augusti 2022 14:02
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] Limit both number of bytes and number of 
objects per domain
 
Hi Peter,

I can't technically answer your question but QuotaEnforcer and 
queueTotalBudget are two different processors and have not been 
implemented in Heritrix to work together.

At BnF, we are using  queueTotalBudgetto manage queues by number of URLs.
Here is what we have in our profiles :

    <!-- FRONTIER (START)
   Record of all URIs discovered and queued-for-collection
   -->
   <bean id="frontier" class="org.archive.crawler.frontier.BdbFrontier">
       <property name="maxRetries" value="10" />
       <property name="retryDelaySeconds" value="60" />
       <property name="recoveryLogEnabled" value="false" />
       <property name="balanceReplenishAmount" value="1000" />
       <property name="errorPenaltyAmount" value="1" />
       <!-- NETARCHIVESUITE Placeholder 
FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER -->
       <property name="queueTotalBudget" 
value="%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}" />
       <property name="snoozeLongMs" value="300000" />
       <property name="extract404s" value="false" />
   </bean>
   <!-- FRONTIER (END) -->

And we have no  place holder for the quotaEnforcer.

Best,

Sara




De :        "Peter Svanberg" <Peter.Svanberg at kb.se>
A :        "netarchivesuite-users at ml.sbforge.org" <
netarchivesuite-users at ml.sbforge.org>
Date :        30/08/2022 13:41
Objet :        Re: [Netarchivesuite-users] Limit both number of bytes and 
number of objects per domain
Envoyé par :        "NetarchiveSuite-users" <
netarchivesuite-users-bounces at ml.sbforge.org>




Sorry, I mixed it up, alt. 3 edited below. So I suppose now that alt. 3 is 
true. And that the value of frontier.queueTotalBudget is irrelevant if you 
use quotaenforcer, i.e. if <ref bean="quotaenforcer"/>  is among the 
fetchProcessors.processors. True?
 
But there is a rumour that you should decide between byte and object limit 
– true or false?
 
Regards,
-----
Peter Svanberg

 
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> 
För Peter Svanberg
Skickat: den 29 augusti 2022 14:20
Till: netarchivesuite-users at ml.sbforge.org
Ämne: [Netarchivesuite-users] Limit both number of bytes and number of 
objects per domain
 
Could someone please explain this handling?
 
In a snapshot we want to limit both number of bytes and number of objects 
per domain. If you give positive values in GUI for new snapshot harvest, 
what is recommended?
 
1.       You should not. Why not?
2.       You must change 
settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcerto 
false and change
settings.harvester.harvesting.harvestReport.class to 
dk.netarkivet.harvester.harvesting.report.BnfHarvestReport(which doesn’t 
assume annotations in crawl log).
3.       You can keep 
settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforceras true 
and it works …? Even though FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER (and 
hence frontier.queueTotalBudget) is set to infinity?
QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDERin template (and hence 
quotaenforcer.groupMaxFetchSuccesses) is set to infinity (in 
configureQuotaEnforcer())?
 
Regards,
 
 


Peter Svanberg
Technical officer 
Aquisitions and Metadata Department
Film, Games, Sheet Music and Web Unit
 
National Library of Sweden
PO Box 5039, SE-102 41 Stockholm
Visits: Karlavägen 96, Stockholm
+46 10-709 32 78
Peter.Svanberg at kb.se
www.kb.se

 
 _______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users

Samedi 17 et dimanche 18 septembre 2022 : la BnF fête la réouverture du 
site Richelieu, après douze ans de travaux de rénovation et de 
modernisation, avec un parcours de visite en compagnie d’artistes et 
comédiens l'après-midi, et des événements et performances la soirée. 
Avant d'imprimer, pensez à l'environnement.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users


Samedi 17 et dimanche 18 septembre 2022 :  la BnF fête la réouverture du site Richelieu , après douze ans de travaux de rénovation et de modernisation, avec  un parcours de visite  en compagnie d’artistes et comédiens l'après-midi, et  des événements et performances  la soirée.  Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20220901/7dc5246a/attachment-0001.html>


More information about the NetarchiveSuite-users mailing list