[Netarchivesuite-users] Limit both number of bytes and number of objects per domain

Peter Svanberg Peter.Svanberg at kb.se
Wed Aug 31 17:11:06 CEST 2022


Hi Sara,

This was interesting! Lost of questions:


1)      Have you used positive values on objects max?

2)      Have you changed settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcer or is it true ?

3)      What is your value on settings.harvester.harvesting.harvestReport.class ? BnfHarvestReport or LegacyHarvestReport

The template examples in NAS have both frontier and quotaenforcer, but with this comment.

## Can be used instead of the QuotaEnforcer module. In this case the following line should look
## like: frontier.queueTotalBudget=%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}
## instead of: frontier.queueTotalBudget=

frontier.queueTotalBudget=%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}

(Somewhat later:) Now I see, your statement makes me understand the handling in configureQuotaEnforcer(): it makes it possible to have both, by setting the value of the one that shouldn’t be used to infinity.

But you must have False in (2) and BnfHarvestReport in (3) above, or else I’m puzzled again. ☺

-----
Peter


Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För sara.aubry at bnf.fr<mailto:sara.aubry at bnf.fr>
Skickat: den 30 augusti 2022 14:02
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Limit both number of bytes and number of objects per domain

Hi Peter,

I can't technically answer your question but QuotaEnforcer and queueTotalBudget are two different processors and have not been implemented in Heritrix to work together.

At BnF, we are using  queueTotalBudgetto manage queues by number of URLs.
Here is what we have in our profiles :

    <!-- FRONTIER (START)
   Record of all URIs discovered and queued-for-collection
   -->
   <bean id="frontier" class="org.archive.crawler.frontier.BdbFrontier">
       <property name="maxRetries" value="10" />
       <property name="retryDelaySeconds" value="60" />
       <property name="recoveryLogEnabled" value="false" />
       <property name="balanceReplenishAmount" value="1000" />
       <property name="errorPenaltyAmount" value="1" />
       <!-- NETARCHIVESUITE Placeholder FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER -->
       <property name="queueTotalBudget" value="%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}" />
       <property name="snoozeLongMs" value="300000" />
       <property name="extract404s" value="false" />
   </bean>
   <!-- FRONTIER (END) -->

And we have no  place holder for the quotaEnforcer.

Best,

Sara




De :        "Peter Svanberg" <Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>>
A :        "netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>" <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>
Date :        30/08/2022 13:41
Objet :        Re: [Netarchivesuite-users] Limit both number of bytes and number of objects per domain
Envoyé par :        "NetarchiveSuite-users" <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>>
________________________________



Sorry, I mixed it up, alt. 3 edited below. So I suppose now that alt. 3 is true. And that the value of frontier.queueTotalBudget is irrelevant if you use quotaenforcer, i.e. if <ref bean="quotaenforcer"/>  is among the fetchProcessors.processors. True?

But there is a rumour that you should decide between byte and object limit – true or false?

Regards,
-----
Peter Svanberg


Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Peter Svanberg
Skickat: den 29 augusti 2022 14:20
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: [Netarchivesuite-users] Limit both number of bytes and number of objects per domain

Could someone please explain this handling?

In a snapshot we want to limit both number of bytes and number of objects per domain. If you give positive values in GUI for new snapshot harvest, what is recommended?

1.       You should not. Why not?
2.       You must change settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcerto false and change
settings.harvester.harvesting.harvestReport.class to dk.netarkivet.harvester.harvesting.report.BnfHarvestReport(which doesn’t assume annotations in crawl log).
3.       You can keep settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforceras true and it works …? Even though FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER (and hence frontier.queueTotalBudget) is set to infinity?QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDERin template (and hence quotaenforcer.groupMaxFetchSuccesses) is set to infinity (in configureQuotaEnforcer())?

Regards,



Peter Svanberg
Technical officer
Aquisitions and Metadata Department
Film, Games, Sheet Music and Web Unit

National Library of Sweden
PO Box 5039, SE-102 41 Stockholm
Visits: Karlavägen 96, Stockholm
+46 10-709 32 78
Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>
www.kb.se<https://www.kb.se/>



 _______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users

________________________________

Samedi 17 et dimanche 18 septembre 2022 : la BnF fête la réouverture du site Richelieu, après douze ans de travaux de rénovation et de modernisation, avec un parcours de visite<https://www.bnf.fr/fr/agenda/richelieu-le-reveil-reouverture> en compagnie d’artistes et comédiens l'après-midi, et des événements et performances<https://www.bnf.fr/fr/agenda/richelieu-le-reveil-performances-et-lectures> la soirée.

Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20220831/f26b42e7/attachment-0001.html>


More information about the NetarchiveSuite-users mailing list