[Netarchivesuite-users] File to exclude domains?

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Nov 15 14:53:43 CET 2023


Hi Peter,

>How do you distribute this file to each job directory?
>In Sara’s example there was an absolute path, /dlweb/data/nas/exclude.txt 
, is that a common file system available on all harvester hosts? We don’t 
have that at the moment.

The answer is yes. Our crawlers share common disk spaces and 
/dlweb/data/nas/ is one of them.
If you don't, have one, copying the updated file to the crawler file 
system will also work. You can change the content of this file while the 
crawl is running.

 SurtPrefixedDecideRule is a DecideRule so I guess it depends on your 
rules organization.

Thanks for letting us know about the surtsSource.

Best,

Sara



De :    "Peter Svanberg" <Peter.Svanberg at kb.se>
A :     "netarchivesuite-users at ml.sbforge.org" 
<netarchivesuite-users at ml.sbforge.org>
Date :  15/11/2023 11:50
Objet : Re: [Netarchivesuite-users] File to exclude domains?
Envoyé par :    "NetarchiveSuite-users" 
<netarchivesuite-users-bounces at ml.sbforge.org>



A followup and simple(?) question related to the answer:
 
When you specify files to be read from a bean, specified with just a 
filename, no path – how do you distribute this file to each job? I fail to 
find a way to do this.
 
The example was the SurtPrefixedDecideRule bean, where the file was 
specified as
<property name="surtsSourceFile" value="exclude.txt" />
How do you distribute this file to each job directory?
 
In Sara’s example there was an absolute path, /dlweb/data/nas/exclude.txt 
, is that a common file system available on all harvester hosts? We don’t 
have that at the moment.
 
Also,  I suppose the already present bean in the standard SCOPE 
sequecence:
      <!-- ...but REJECT those from a configurable (initially empty) set 
of REJECT SURTs... -->
Is the correct place for this SurtPrefixedDecideRule bean? 
 
BTW, according to Heritrix source code, surtsSourceFile is deprecated, you 
should use surtsSource, like this:
 
<property name="surtsSource">
    <bean class="org.archive.spring.ConfigFile">
        <property name="path" value="exclude.txt" />
    </bean>
</property>
 
 
-----
Peter Svanberg

 
Från: Peter Svanberg 
Skickat: den 20 oktober 2023 18:54
Till: netarchivesuite-users at ml.sbforge.org
Ämne: SV: [Netarchivesuite-users] File to exclude domains?
 
Thank you Sara and Bert! 
 
I was under the impression that it was some special treatment outside of 
the decideRule system, but this is perfect!
 
-----
Peter Sv.
 
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> 
För sara.aubry at bnf.fr
Skickat: den 20 oktober 2023 09:29
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] File to exclude domains?
 
Hello Peter,

We use Heritrix exclude.txt mechanism which you can activate with the 
following bean in your profile:

<bean id="rejectExcludedSurts" 
class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
       <!-- Decision value (ACCEPT, REJECT, NONE) -->
    <property name="decision" value="REJECT" />
    <property name="surtsSourceFile" value="/dlweb/data/nas/exclude.txt" 
/>
    <property name="seedsAsSurtPrefixes" value="false" />
    <property name="alsoCheckVia" value="false" />
    <property name="surtsDumpFile" value="/dlweb/data/nas/exclude.dump" />
</bean>

Best,

Sara




De :        "Peter Svanberg" <Peter.Svanberg at kb.se>
A :        "netarchivesuite-users at ml.sbforge.org" <
netarchivesuite-users at ml.sbforge.org>
Date :        19/10/2023 21:11
Objet :        [Netarchivesuite-users] File to exclude domains?
Envoyé par :        "NetarchiveSuite-users" <
netarchivesuite-users-bounces at ml.sbforge.org>




I have a definite recollection of Sara talking about a file you can create 
containing domain names to be excluded from a snapshot. But I can't find 
any info on that anywhere. (Other than NAS-1725 but not what was done with 
that.) Can someone remind me?
 
(I know you can configure with zeros but a list in a file would be 
easier.)
-----

Peter Svanberg
National Library of Sweden
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users

Expositions Épreuves de la matière du 10 octobre 2023 au 4 février 2024 et 
Noir & Blanc : une esthétique de la photographie du 17 octobre 2023 au 21 
janvier 2024 | François-Mitterrand.
Participez à l’acquisition du bréviaire de Charles V, très rare manuscrit 
enluminé du XIVe siècle
Avant d'imprimer, pensez à l'environnement.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users


Expositions  Épreuves de la matière  du 10 octobre 2023 au 4 février 2024 et  Noir & Blanc : une esthétique de la photographie  du 17 octobre 2023 au 21 janvier 2024 | François-Mitterrand. Participez à l’acquisition du bréviaire de Charles V, très rare manuscrit enluminé du XIV e  siècle Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20231115/b409927c/attachment.html>


More information about the NetarchiveSuite-users mailing list