[Netarchivesuite-users] Exclude Domain from Fullcrawl
Karen Høgsberg
kah at kb.dk
Fri Apr 24 12:08:32 CEST 2009
In the default seed-list that is used for snapshot harvesting, you mark the seed with #
e.g. '#http://www.autosscout.dk'.
The domain will then be excluded from harvesting.
Kind regards,
Karen Høgsberg
-----Oprindelig meddelelse-----
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af aponb at gmx.at
Sendt: 24. april 2009 11:10
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Exclude Domain from Fullcrawl
I would like to know if there is any possibility to exclude domains from
a full crawl, except via the crawlertraps in the settings.xml and the
limits configuration for that domain (which can only be a work around).
The thing is, that if you have some selective crawls which contain seeds
not beloning to your national domain, then this domain will be created
in order to execute the selective crawls. When you start the first
fullharvest that domain will be also crawled, although the domain doesn'
t belong to your range.
Another possiblity would be to modify that seed, which is belonging to
the defaultconfig, so that only that seed will be crawled during the
domain crawl. That seed could be of course that seed, which was used
during the selective crawl. But this link could be already outdated and
wouldn't crawled at all. Works - but it's a work around.
Just would like to know how you are thinking about this and how you are
solving this issue?
Thanks for your time
a.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users
More information about the NetarchiveSuite-users
mailing list