[Netarchivesuite-users] File to exclude domains?
Peter Svanberg
Peter.Svanberg at kb.se
Wed Nov 15 11:49:57 CET 2023
A followup and simple(?) question related to the answer:
When you specify files to be read from a bean, specified with just a filename, no path – how do you distribute this file to each job? I fail to find a way to do this.
The example was the SurtPrefixedDecideRule bean, where the file was specified as
<property name="surtsSourceFile" value="exclude.txt" />
How do you distribute this file to each job directory?
In Sara’s example there was an absolute path, /dlweb/data/nas/exclude.txt , is that a common file system available on all harvester hosts? We don’t have that at the moment.
Also, I suppose the already present bean in the standard SCOPE sequecence:
<!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... -->
Is the correct place for this SurtPrefixedDecideRule bean?
BTW, according to Heritrix source code, surtsSourceFile is deprecated, you should use surtsSource, like this:
<property name="surtsSource">
<bean class="org.archive.spring.ConfigFile">
<property name="path" value="exclude.txt" />
</bean>
</property>
-----
Peter Svanberg
Från: Peter Svanberg
Skickat: den 20 oktober 2023 18:54
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: SV: [Netarchivesuite-users] File to exclude domains?
Thank you Sara and Bert!
I was under the impression that it was some special treatment outside of the decideRule system, but this is perfect!
-----
Peter Sv.
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För sara.aubry at bnf.fr<mailto:sara.aubry at bnf.fr>
Skickat: den 20 oktober 2023 09:29
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] File to exclude domains?
Hello Peter,
We use Heritrix exclude.txt mechanism which you can activate with the following bean in your profile:
<bean id="rejectExcludedSurts" class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<!-- Decision value (ACCEPT, REJECT, NONE) -->
<property name="decision" value="REJECT" />
<property name="surtsSourceFile" value="/dlweb/data/nas/exclude.txt" />
<property name="seedsAsSurtPrefixes" value="false" />
<property name="alsoCheckVia" value="false" />
<property name="surtsDumpFile" value="/dlweb/data/nas/exclude.dump" />
</bean>
Best,
Sara
De : "Peter Svanberg" <Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>>
A : "netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>" <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>
Date : 19/10/2023 21:11
Objet : [Netarchivesuite-users] File to exclude domains?
Envoyé par : "NetarchiveSuite-users" <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>>
________________________________
I have a definite recollection of Sara talking about a file you can create containing domain names to be excluded from a snapshot. But I can't find any info on that anywhere. (Other than NAS-1725 but not what was done with that.) Can someone remind me?
(I know you can configure with zeros but a list in a file would be easier.)
-----
Peter Svanberg
National Library of Sweden
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
________________________________
Expositions Épreuves de la matière<https://www.bnf.fr/fr/agenda/epreuves-de-la-matiere> du 10 octobre 2023 au 4 février 2024 et Noir & Blanc : une esthétique de la photographie<https://www.bnf.fr/fr/agenda/noir-blanc-une-esthetique-de-la-photographie> du 17 octobre 2023 au 21 janvier 2024 | François-Mitterrand.
Participez à l’acquisition du bréviaire de Charles V, très rare manuscrit enluminé du XIVe siècle<https://www.bnf.fr/fr/participez-lacquisition-du-breviaire-de-charles-v>
Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20231115/daa0f241/attachment-0001.html>
More information about the NetarchiveSuite-users
mailing list