[Netarchivesuite-users] Limit both number of bytes and number of objects per domain
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Thu Sep 1 09:13:40 CEST 2022
Hi Peter,
1) Have you used positive values on objects max?
We used a set of positive values in "Maximum number of objects" defined in
configurations (usually 50,000, 100,000 or 150,000).
2) Have you changed
settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcer or is
it true ?
objectLimitIsSetByQuotaEnforcer is set to false
3) What is your value on
settings.harvester.harvesting.harvestReport.class ? BnfHarvestReport or
LegacyHarvestReport
settings.harvester.harvesting.harvestReport.class is set to
BnfHarvestReport
Best,
Sara
De : "Peter Svanberg" <Peter.Svanberg at kb.se>
A : "netarchivesuite-users at ml.sbforge.org"
<netarchivesuite-users at ml.sbforge.org>
Date : 31/08/2022 17:11
Objet : Re: [Netarchivesuite-users] Limit both number of bytes and number
of objects per domain
Envoyé par : "NetarchiveSuite-users"
<netarchivesuite-users-bounces at ml.sbforge.org>
Hi Sara,
This was interesting! Lost of questions:
1) Have you used positive values on objects max?
2) Have you changed
settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcer or is
it true ?
3) What is your value on
settings.harvester.harvesting.harvestReport.class ? BnfHarvestReport or
LegacyHarvestReport
The template examples in NAS have both frontier and quotaenforcer, but
with this comment.
## Can be used instead of the QuotaEnforcer module. In this case the
following line should look
## like:
frontier.queueTotalBudget=%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}
## instead of: frontier.queueTotalBudget=
frontier.queueTotalBudget=%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}
(Somewhat later:) Now I see, your statement makes me understand the
handling in configureQuotaEnforcer(): it makes it possible to have both,
by setting the value of the one that shouldn’t be used to infinity.
But you must have False in (2) and BnfHarvestReport in (3) above, or else
I’m puzzled again. J
-----
Peter
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org>
För sara.aubry at bnf.fr
Skickat: den 30 augusti 2022 14:02
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] Limit both number of bytes and number of
objects per domain
Hi Peter,
I can't technically answer your question but QuotaEnforcer and
queueTotalBudget are two different processors and have not been
implemented in Heritrix to work together.
At BnF, we are using queueTotalBudgetto manage queues by number of URLs.
Here is what we have in our profiles :
<!-- FRONTIER (START)
Record of all URIs discovered and queued-for-collection
-->
<bean id="frontier" class="org.archive.crawler.frontier.BdbFrontier">
<property name="maxRetries" value="10" />
<property name="retryDelaySeconds" value="60" />
<property name="recoveryLogEnabled" value="false" />
<property name="balanceReplenishAmount" value="1000" />
<property name="errorPenaltyAmount" value="1" />
<!-- NETARCHIVESUITE Placeholder
FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER -->
<property name="queueTotalBudget"
value="%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}" />
<property name="snoozeLongMs" value="300000" />
<property name="extract404s" value="false" />
</bean>
<!-- FRONTIER (END) -->
And we have no place holder for the quotaEnforcer.
Best,
Sara
De : "Peter Svanberg" <Peter.Svanberg at kb.se>
A : "netarchivesuite-users at ml.sbforge.org" <
netarchivesuite-users at ml.sbforge.org>
Date : 30/08/2022 13:41
Objet : Re: [Netarchivesuite-users] Limit both number of bytes and
number of objects per domain
Envoyé par : "NetarchiveSuite-users" <
netarchivesuite-users-bounces at ml.sbforge.org>
Sorry, I mixed it up, alt. 3 edited below. So I suppose now that alt. 3 is
true. And that the value of frontier.queueTotalBudget is irrelevant if you
use quotaenforcer, i.e. if <ref bean="quotaenforcer"/> is among the
fetchProcessors.processors. True?
But there is a rumour that you should decide between byte and object limit
– true or false?
Regards,
-----
Peter Svanberg
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org>
För Peter Svanberg
Skickat: den 29 augusti 2022 14:20
Till: netarchivesuite-users at ml.sbforge.org
Ämne: [Netarchivesuite-users] Limit both number of bytes and number of
objects per domain
Could someone please explain this handling?
In a snapshot we want to limit both number of bytes and number of objects
per domain. If you give positive values in GUI for new snapshot harvest,
what is recommended?
1. You should not. Why not?
2. You must change
settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcerto
false and change
settings.harvester.harvesting.harvestReport.class to
dk.netarkivet.harvester.harvesting.report.BnfHarvestReport(which doesn’t
assume annotations in crawl log).
3. You can keep
settings.harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforceras true
and it works …? Even though FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER (and
hence frontier.queueTotalBudget) is set to infinity?
QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDERin template (and hence
quotaenforcer.groupMaxFetchSuccesses) is set to infinity (in
configureQuotaEnforcer())?
Regards,
Peter Svanberg
Technical officer
Aquisitions and Metadata Department
Film, Games, Sheet Music and Web Unit
National Library of Sweden
PO Box 5039, SE-102 41 Stockholm
Visits: Karlavägen 96, Stockholm
+46 10-709 32 78
Peter.Svanberg at kb.se
www.kb.se
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
Samedi 17 et dimanche 18 septembre 2022 : la BnF fête la réouverture du
site Richelieu, après douze ans de travaux de rénovation et de
modernisation, avec un parcours de visite en compagnie d’artistes et
comédiens l'après-midi, et des événements et performances la soirée.
Avant d'imprimer, pensez à l'environnement.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
Samedi 17 et dimanche 18 septembre 2022 : la BnF fête la réouverture du site Richelieu , après douze ans de travaux de rénovation et de modernisation, avec un parcours de visite en compagnie d’artistes et comédiens l'après-midi, et des événements et performances la soirée. Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20220901/7dc5246a/attachment-0001.html>
More information about the NetarchiveSuite-users
mailing list