[Netarchivesuite-users] Exclude Domain from Fullcrawl

Fri Apr 24 12:50:07 CEST 2009

This is though still a workaround because seeds (and thus domains) will still get into jobs - heritrix just ignores seeds starting with '#'

The better way would be to have a true/false field in the database on domain-level and let the GUI switch this on the domain-page but maybe also let you import seeds with the smart-import function in the buttom of the selective harvest definition page you can decide if new domains (automatically added to the system) should be included in snapshots or not. Today you have to manually go into each domain and use this workaround.

best
Bjarne
________________________________________
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af Karen Høgsberg [kah at kb.dk]
Sendt: 24. april 2009 12:08
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: Re: [Netarchivesuite-users] Exclude Domain from Fullcrawl

In the default seed-list that is used for snapshot harvesting, you mark the seed with #
e.g. '#http://www.autosscout.dk'.

The domain will then be excluded from harvesting.

Kind regards,
Karen Høgsberg

-----Oprindelig meddelelse-----
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af aponb at gmx.at
Sendt: 24. april 2009 11:10
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Exclude Domain from Fullcrawl

I would like to know if there is any possibility to exclude domains from
a full crawl, except via the crawlertraps in the settings.xml and the
limits configuration for that domain (which can only be a work around).
The thing is, that if you have some selective crawls which contain seeds
not beloning to your national domain, then this domain will be created
in order to execute the selective crawls. When you start the first
fullharvest that domain will be also crawled, although the domain doesn'
t belong to your range.
Another possiblity would be to modify that seed, which is belonging to
the defaultconfig, so that only that seed will be crawled during the
domain crawl. That seed could be of course that seed, which was used
during the selective crawl. But this link could be already outdated and
wouldn't crawled at all. Works - but it's a work around.

Just would like to know how you are thinking about this and how you are
solving this issue?

Thanks for your time
a.

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users