[Netarchivesuite-users] File to exclude domains?

Peter Svanberg Peter.Svanberg at kb.se
Wed Nov 15 11:49:57 CET 2023


A followup and simple(?) question related to the answer:

When you specify files to be read from a bean, specified with just a filename, no path – how do you distribute this file to each job? I fail to find a way to do this.

The example was the SurtPrefixedDecideRule bean, where the file was specified as
<property name="surtsSourceFile" value="exclude.txt" />
How do you distribute this file to each job directory?

In Sara’s example there was an absolute path, /dlweb/data/nas/exclude.txt , is that a common file system available on all harvester hosts? We don’t have that at the moment.

Also,  I suppose the already present bean in the standard SCOPE sequecence:
      <!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... -->
Is the correct place for this SurtPrefixedDecideRule bean?

BTW, according to Heritrix source code, surtsSourceFile is deprecated, you should use surtsSource, like this:

<property name="surtsSource">
    <bean class="org.archive.spring.ConfigFile">
        <property name="path" value="exclude.txt" />
    </bean>
</property>


-----
Peter Svanberg


Från: Peter Svanberg
Skickat: den 20 oktober 2023 18:54
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: SV: [Netarchivesuite-users] File to exclude domains?

Thank you Sara and Bert!

I was under the impression that it was some special treatment outside of the decideRule system, but this is perfect!

-----
Peter Sv.

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För sara.aubry at bnf.fr<mailto:sara.aubry at bnf.fr>
Skickat: den 20 oktober 2023 09:29
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] File to exclude domains?

Hello Peter,

We use Heritrix exclude.txt mechanism which you can activate with the following bean in your profile:

<bean id="rejectExcludedSurts" class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
       <!-- Decision value (ACCEPT, REJECT, NONE) -->
    <property name="decision" value="REJECT" />
    <property name="surtsSourceFile" value="/dlweb/data/nas/exclude.txt" />
    <property name="seedsAsSurtPrefixes" value="false" />
    <property name="alsoCheckVia" value="false" />
    <property name="surtsDumpFile" value="/dlweb/data/nas/exclude.dump" />
</bean>

Best,

Sara




De :        "Peter Svanberg" <Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>>
A :        "netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>" <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>
Date :        19/10/2023 21:11
Objet :        [Netarchivesuite-users] File to exclude domains?
Envoyé par :        "NetarchiveSuite-users" <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>>
________________________________



I have a definite recollection of Sara talking about a file you can create containing domain names to be excluded from a snapshot. But I can't find any info on that anywhere. (Other than NAS-1725 but not what was done with that.) Can someone remind me?

(I know you can configure with zeros but a list in a file would be easier.)
-----

Peter Svanberg
National Library of Sweden
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
________________________________

Expositions Épreuves de la matière<https://www.bnf.fr/fr/agenda/epreuves-de-la-matiere> du 10 octobre 2023 au 4 février 2024 et Noir & Blanc : une esthétique de la photographie<https://www.bnf.fr/fr/agenda/noir-blanc-une-esthetique-de-la-photographie> du 17 octobre 2023 au 21 janvier 2024 | François-Mitterrand.

Participez à l’acquisition du bréviaire de Charles V, très rare manuscrit enluminé du XIVe siècle<https://www.bnf.fr/fr/participez-lacquisition-du-breviaire-de-charles-v>

Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20231115/daa0f241/attachment-0001.html>


More information about the NetarchiveSuite-users mailing list