[Netarchivesuite-users] Undesired URIs

Wed Feb 10 15:41:01 CET 2010

Dear all, 

as you know we are currently running our domain harvesting, and given that Andreas is on holiday for a long period, you will have to bear with my questions :) 

We have some URIs from two domains that we would not like to harvest. The first domain has already been completely crawled in a previous job, the second will likely be crawled in two or three weeks. My questions are: 

- In order to avoid crawling some specific URIs within a given domain, is it good practice to add them to the crawler traps for that domain, even if they are not crawler traps in themselves? 
- Which runs will this action affect? I imagine that the crawler traps for a given domain stay in the database after a given domain harvesting has terminated, so they should affect not only the current run but also the subsequent ones; 
- Is it possible to add crawler traps that influence the behaviour of a job already running? (From what I know of Heritrix it should be possible, but I'm not sure). To which extent will the job be influenced? What happens if the job was crawling exactly that domain, or, when does the crawler get updates on crawler traps?  

Many thanks in advance, 

Eleonora

Eleonora Nicchiarelli Bettelli
Digital Preservation
Austrian National Library
Josefsplatz 1, 1015 Wien

Tel:  +43 1 53 410 686
Fax: +43 1 53 410 610
Web: http://www.onb.ac.at/
Mail: eleonora.nicchiarelli at onb.ac.at