[Netarchivesuite-users] File to exclude domains?
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Wed Nov 15 14:53:43 CET 2023
Hi Peter,
>How do you distribute this file to each job directory?
>In Sara’s example there was an absolute path, /dlweb/data/nas/exclude.txt
, is that a common file system available on all harvester hosts? We don’t
have that at the moment.
The answer is yes. Our crawlers share common disk spaces and
/dlweb/data/nas/ is one of them.
If you don't, have one, copying the updated file to the crawler file
system will also work. You can change the content of this file while the
crawl is running.
SurtPrefixedDecideRule is a DecideRule so I guess it depends on your
rules organization.
Thanks for letting us know about the surtsSource.
Best,
Sara
De : "Peter Svanberg" <Peter.Svanberg at kb.se>
A : "netarchivesuite-users at ml.sbforge.org"
<netarchivesuite-users at ml.sbforge.org>
Date : 15/11/2023 11:50
Objet : Re: [Netarchivesuite-users] File to exclude domains?
Envoyé par : "NetarchiveSuite-users"
<netarchivesuite-users-bounces at ml.sbforge.org>
A followup and simple(?) question related to the answer:
When you specify files to be read from a bean, specified with just a
filename, no path – how do you distribute this file to each job? I fail to
find a way to do this.
The example was the SurtPrefixedDecideRule bean, where the file was
specified as
<property name="surtsSourceFile" value="exclude.txt" />
How do you distribute this file to each job directory?
In Sara’s example there was an absolute path, /dlweb/data/nas/exclude.txt
, is that a common file system available on all harvester hosts? We don’t
have that at the moment.
Also, I suppose the already present bean in the standard SCOPE
sequecence:
<!-- ...but REJECT those from a configurable (initially empty) set
of REJECT SURTs... -->
Is the correct place for this SurtPrefixedDecideRule bean?
BTW, according to Heritrix source code, surtsSourceFile is deprecated, you
should use surtsSource, like this:
<property name="surtsSource">
<bean class="org.archive.spring.ConfigFile">
<property name="path" value="exclude.txt" />
</bean>
</property>
-----
Peter Svanberg
Från: Peter Svanberg
Skickat: den 20 oktober 2023 18:54
Till: netarchivesuite-users at ml.sbforge.org
Ämne: SV: [Netarchivesuite-users] File to exclude domains?
Thank you Sara and Bert!
I was under the impression that it was some special treatment outside of
the decideRule system, but this is perfect!
-----
Peter Sv.
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org>
För sara.aubry at bnf.fr
Skickat: den 20 oktober 2023 09:29
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] File to exclude domains?
Hello Peter,
We use Heritrix exclude.txt mechanism which you can activate with the
following bean in your profile:
<bean id="rejectExcludedSurts"
class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<!-- Decision value (ACCEPT, REJECT, NONE) -->
<property name="decision" value="REJECT" />
<property name="surtsSourceFile" value="/dlweb/data/nas/exclude.txt"
/>
<property name="seedsAsSurtPrefixes" value="false" />
<property name="alsoCheckVia" value="false" />
<property name="surtsDumpFile" value="/dlweb/data/nas/exclude.dump" />
</bean>
Best,
Sara
De : "Peter Svanberg" <Peter.Svanberg at kb.se>
A : "netarchivesuite-users at ml.sbforge.org" <
netarchivesuite-users at ml.sbforge.org>
Date : 19/10/2023 21:11
Objet : [Netarchivesuite-users] File to exclude domains?
Envoyé par : "NetarchiveSuite-users" <
netarchivesuite-users-bounces at ml.sbforge.org>
I have a definite recollection of Sara talking about a file you can create
containing domain names to be excluded from a snapshot. But I can't find
any info on that anywhere. (Other than NAS-1725 but not what was done with
that.) Can someone remind me?
(I know you can configure with zeros but a list in a file would be
easier.)
-----
Peter Svanberg
National Library of Sweden
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
Expositions Épreuves de la matière du 10 octobre 2023 au 4 février 2024 et
Noir & Blanc : une esthétique de la photographie du 17 octobre 2023 au 21
janvier 2024 | François-Mitterrand.
Participez à l’acquisition du bréviaire de Charles V, très rare manuscrit
enluminé du XIVe siècle
Avant d'imprimer, pensez à l'environnement.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
Expositions Épreuves de la matière du 10 octobre 2023 au 4 février 2024 et Noir & Blanc : une esthétique de la photographie du 17 octobre 2023 au 21 janvier 2024 | François-Mitterrand. Participez à l’acquisition du bréviaire de Charles V, très rare manuscrit enluminé du XIV e siècle Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20231115/b409927c/attachment.html>
More information about the NetarchiveSuite-users
mailing list