[Netarchivesuite-users] Exclude Domain from Fullcrawl

Karen Høgsberg kah at kb.dk
Fri Apr 24 12:08:32 CEST 2009


In the default seed-list that is used for snapshot harvesting, you mark the seed with #
e.g. '#http://www.autosscout.dk'.

The domain will then be excluded from harvesting.

Kind regards,
Karen Høgsberg

-----Oprindelig meddelelse-----
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af aponb at gmx.at
Sendt: 24. april 2009 11:10
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Exclude Domain from Fullcrawl

I would like to know if there is any possibility to exclude domains from 
a full crawl, except via the crawlertraps in the settings.xml and the 
limits configuration for that domain (which can only be a work around).
The thing is, that if you have some selective crawls which contain seeds 
not beloning to your national domain, then this domain will be created 
in order to execute the selective crawls. When you start the first 
fullharvest that domain will be also crawled, although the domain doesn' 
t belong to your range.
Another possiblity would be to modify that seed, which is belonging to 
the defaultconfig, so that only that seed will be crawled during the 
domain crawl. That seed could be of course that seed, which was used 
during the selective crawl. But this link could be already outdated and 
wouldn't crawled at all. Works - but it's a work around.

Just would like to know how you are thinking about this and how you are 
solving this issue?

Thanks for your time
a.

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users




More information about the NetarchiveSuite-users mailing list