[Netarchivesuite-users] Harvest a subdirectory of a domain

Bjarne Andersen netarkivet at statsbiblioteket.dk
Thu Apr 24 10:06:53 CEST 2008


Yes - there cirtainly is.

You need to make a new configuration for the domain (mydomain.org) that covers only that part.

(1) Add the domain the your system
(2) Add a new configuration (or change defaultconfig)
(3) Select another template for your new configuration (or the changed defaultconfig)
     - you need a path-scope-template - the NetarchiveSuite distribution should come with at least two such templates (having 'path' in 
their names)
     - depending on wheter you are using TRUNK from svn or latest stable release it could have different namings since we moving towards 
DecidingScope in our use of heritrix.

The seedlist for your new configuration must contain seeds that have a path inside them - including a tailing-slash
  - www.mydomain.com/subdir - will allow the ENTIRE host www.mydomain.com
  - www.mydomain.com/subdir/ - will only allow the path /subdir/ (and subdirs to that) on the host www.mydomain.com

good luck.

best
-- 
Bjarne Andersen
Daily Manager - netarchive.dk

State & University Library
Universitetsparken
DK-8000 Aarhus C
T: +45 89462165 - C: +45 25662353
CVR/SE 10100682 - EAN 5798000791084
http://netarchive.dk


Peter Moser wrote:
> Hi!
> 
> Is there a possibility to harvest a part of a domain. I would like to add for example the following link to the seedlist: www.mydomain.org/subdir
> so only the subdir under should be fetched.
> If that is not possible with the netarchive suite, where can I start to change the application to do so.
> Can I do if I use only the heritrix application?
> 
> Thanks in advance for answering!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: netarkivet.vcf
Type: text/x-vcard
Size: 312 bytes
Desc: not available
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20080424/caf72712/attachment-0002.vcf>


More information about the NetarchiveSuite-users mailing list