[Netarchivesuite-users] Limit both number of bytes and number of objects per domain

sara.aubry at bnf.fr sara.aubry at bnf.fr
Tue Aug 30 14:02:29 CEST 2022


Hi Peter,

I can't technically answer your question but QuotaEnforcer and 
queueTotalBudget are two different processors and have not been 
implemented in Heritrix to work together.

At BnF, we are using  queueTotalBudget to manage queues by number of URLs.
Here is what we have in our profiles :

    <!-- FRONTIER (START)
    Record of all URIs discovered and queued-for-collection
    -->
    <bean id="frontier" class="org.archive.crawler.frontier.BdbFrontier">
        <property name="maxRetries" value="10" />
        <property name="retryDelaySeconds" value="60" />
        <property name="recoveryLogEnabled" value="false" />
        <property name="balanceReplenishAmount" value="1000" />
        <property name="errorPenaltyAmount" value="1" />
        <!-- NETARCHIVESUITE Placeholder 
FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER -->
        <property name="queueTotalBudget" 
value="%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}" />
        <property name="snoozeLongMs" value="300000" />
        <property name="extract404s" value="false" />
    </bean>
    <!-- FRONTIER (END) -->

And we have no  place holder for the quotaEnforcer.

Best,

Sara




De :    "Peter Svanberg" <Peter.Svanberg at kb.se>
A :     "netarchivesuite-users at ml.sbforge.org" 
<netarchivesuite-users at ml.sbforge.org>
Date :  30/08/2022 13:41
Objet : Re: [Netarchivesuite-users] Limit both number of bytes and number 
of objects per domain
Envoyé par :    "NetarchiveSuite-users" 
<netarchivesuite-users-bounces at ml.sbforge.org>



Sorry, I mixed it up, alt. 3 edited below. So I suppose now that alt. 3 is 
true. And that the value of frontier.queueTotalBudget is irrelevant if you 
use quotaenforcer, i.e. if <ref bean="quotaenforcer"/>  is among the 
fetchProcessors.processors. True?
 
But there is a rumour that you should decide between byte and object limit 
– true or false?
 
Regards,
-----
Peter Svanberg

 
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> 
För Peter Svanberg
Skickat: den 29 augusti 2022 14:20
Till: netarchivesuite-users at ml.sbforge.org
Ämne: [Netarchivesuite-users] Limit both number of bytes and number of 
objects per domain
 
Could someone please explain this handling?
 
In a snapshot we want to limit both number of bytes and number of objects 
per domain. If you give positive values in GUI for new snapshot harvest, 
what is recommended?
 
1.       You should not. Why not?
2.       You must change 
settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcer to 
false and change
settings.harvester.harvesting.harvestReport.class to 
dk.netarkivet.harvester.harvesting.report.BnfHarvestReport (which doesn’t 
assume annotations in crawl log).
3.       You can keep 
settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcer as 
true and it works …? Even though FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER 
(and hence frontier.queueTotalBudget) is set to infinity?
QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDER in template (and hence 
quotaenforcer.groupMaxFetchSuccesses) is set to infinity (in 
configureQuotaEnforcer())?
 
Regards,
 
 


Peter Svanberg
Technical officer 
Aquisitions and Metadata Department
Film, Games, Sheet Music and Web Unit
 
National Library of Sweden
PO Box 5039, SE-102 41 Stockholm
Visits: Karlavägen 96, Stockholm
+46 10-709 32 78
Peter.Svanberg at kb.se
www.kb.se
 
 _______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users


Samedi 17 et dimanche 18 septembre 2022 :  la BnF fête la réouverture du site Richelieu , après douze ans de travaux de rénovation et de modernisation, avec  un parcours de visite  en compagnie d’artistes et comédiens l'après-midi, et  des événements et performances  la soirée.  Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20220830/71f06b4b/attachment-0001.html>


More information about the NetarchiveSuite-users mailing list