[Netarchivesuite-users] Undesired URIs

Bjarne Andersen bja at statsbiblioteket.dk
Thu Feb 11 08:16:13 CET 2010

See my answers below

Bjarne Andersen
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af Nicchiarelli Eleonora [eleonora.nicchiarelli at onb.ac.at]
Sendt: 10. februar 2010 15:41
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Undesired URIs

Dear all,

as you know we are currently running our domain harvesting, and given that Andreas is on holiday for a long period, you will have to bear with my questions :)

We have some URIs from two domains that we would not like to harvest. The first domain has already been completely crawled in a previous job, the second will likely be crawled in two or three weeks. My questions are:

- In order to avoid crawling some specific URIs within a given domain, is it good practice to add them to the crawler traps for that domain, even if they are not crawler traps in themselves?

** Thats what we currently do - if it's URIs that we never want to harvest we treat them as crawlertraps (or spam or just unwanted material)

- Which runs will this action affect? I imagine that the crawler traps for a given domain stay in the database after a given domain harvesting has terminated, so they should affect not only the current run but also the subsequent ones;

** Adding regular expressions at the domain level will influence all runs with all configurations on that domain. A nice feature that we have discussed would be the possibility to add similar regExps at the configuration level to build special configurations that avoid special things

- Is it possible to add crawler traps that influence the behaviour of a job already running? (From what I know of Heritrix it should be possible, but I'm not sure). To which extent will the job be influenced? What happens if the job was crawling exactly that domain, or, when does the crawler get updates on crawler traps?

** That is possible when heritrix is running from within the heritrix GUI - you have to catch the job running yourself, so it's quite hard especially for domain harvesting since jobs could be started during e.g. nighttime. So we only use this feature if we notice crawlertraps during a running job. Things changed directly within heritrix GUI is not influencing anything in NetarchiveSuite - so if you add a filter directly in heritrix and want that filter to be permanent you have to add the same filter in NetarchiveSuite as well. The crawler does not get any updates from NetarchiveSuite once jobs are scheduled. So when a domain harvest is scheduled all jobs are created with the filters currently in NetarchiveSuite - so you can't change anything for an already started harvest unless doing it directly in heritrix GUI (with the previously described difficulties). We have talked about a nice other feature request that would allow global crawlertraps that will be applied to all harvests at all time - these kind of filters we currently define directly in the harvester-templates. If they could be defined directly in the NetarchiveSuite interface they maybe even could be dynamically loaded by the harvesters allowing for operators to add filters to scheduled jobs that have not been started yet. What we currently do in Denmark if we notice severe problems with a domain (or several domains from e.g. same hosting company) is that we can have our network administrators to block cirtain IP-numbers until a domain crawl is done. That off cause means that we won't harvest anything from those domains - but thats most likely to be better than to offend them. I think we have used this workaround 2 or 3 times during the last 3 years.

Many thanks in advance,


Eleonora Nicchiarelli Bettelli
Digital Preservation
Austrian National Library
Josefsplatz 1, 1015 Wien

Tel:  +43 1 53 410 686
Fax: +43 1 53 410 610
Web: http://www.onb.ac.at/
Mail: eleonora.nicchiarelli at onb.ac.at

NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk

More information about the NetarchiveSuite-users mailing list