[Netarchivesuite-users] seedsRedirectNewSeeds

Tue Hejlskov Larsen tlr at kb.dk
Tue Nov 2 18:35:48 CET 2021


Hi Peter

In our defaultorder.xml we have set it so:

    <bean id="candidates" class="org.archive.crawler.postprocessor.CandidatesProcessor">
        <!-- Allow redirected seeds to be accepted as seeds
        In H1, this property belonged to the LinkScoper object, in H3, it is part of the CandidatesProcessor object
        -->
        <property name="seedsRedirectNewSeeds" value="false" />
    </bean>

seedsRedirectNewSeeds = false because many redirects on domains either pointed to foreign domains that were not Danish content at all or  pointed to other .dk domains that we had already harvested and thus we would get many extra harvests and use a lot of extra space. What you lose by not using "seedsRedirectNewSeeds" is where re-directes actually point to a non-dk domain that we would like have also.
That is why the webdanica project was invented to find that content in another way.

Best regards
Tue

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> On Behalf Of Peter Svanberg
Sent: Tuesday, November 2, 2021 4:21 PM
To: 'netarchivesuite-users at ml.sbforge.org' <netarchivesuite-users at ml.sbforge.org>
Subject: [Netarchivesuite-users] seedsRedirectNewSeeds

seedsRedirectNewSeeds was the parameter I mentioned on the meeting.

Any cons and pros on true/false on this? I can imagine that the redirection could give problems, but do they?
Has those of you who have chosen “false” some experience?

---------
   /**
     * If enabled, any URL found because a seed redirected to it (original seed
     * returned 301 or 302), will also be treated as a seed, as long as the hop
     * count is less than {@value #SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS}.
     */

    protected static final int SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS = 5;

<bean id="candidates" class="org.archive.crawler.postprocessor.CandidatesProcessor">
      <property name="seedsRedirectNewSeeds" value="true" />
</bean>
---------

[KB Logo]<https://www.kb.se/>

Peter Svanberg
Teknisk handläggare
Insamling och metadata
Insamling 1

Kungliga biblioteket
Box 5039, 102 41 Stockholm
Besöksadress: Karlavägen 96, Stockholm
+46 10 709 32 78
Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>
www.kb.se<https://www.kb.se/>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20211102/07d999a7/attachment.html>


More information about the NetarchiveSuite-users mailing list