[Netarchivesuite-users] help on crawling behind proxy

Bjarne Andersen bja at statsbiblioteket.dk
Tue Apr 19 18:23:27 CEST 2011

Hi Ruben.

It should be perfectly possible to crawl through a proxy with the current netarchiveSuite 3.14 and heritrix 1.14.4
The heritrix module FetchHTTP is not new to heritrix 2.0 at all - its the most crucial fetcher-modul of heritrix since most of the fetches of an internet crawler is http.

What you have to do is to do some minor edits to the harvest templates in NetarchiveSuite.

This is done by downloading them to your local PC from the menu (Definitions -> Edit Harvest Templates) - select a template a select "save to disk" in the select box.

The edit the harvest template (its plain XML) - especially the 2 values in the FetchHTTP configuration
<string name="http-proxy-host"/>
<string name="http-proxy-port"/>
to e.g.
<string name="http-proxy-host">proxyhostname</string>
<string name="http-proxy-port">80</string>

After you have edited the XML-file you must upload it back to NetarchiveSuite again. Make sure to override the right template when you upload

If all your crawls need to go through the proxy you will ned to do this trick with all your templates. I don't remember the number of different templates in a fresh install, but it shouldn't be that many

best and good luck
Bjarne Andersen

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af Ruben [rtmoran at gmail.com]
Sendt: 19. april 2011 16:06
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] help on crawling behind proxy

Hi there,

I'm testing Netarchive Suite on a network behind a proxy (seems to be
mandatory here to stay behind the proxy).

I see NetArchive 3.14 uses Heritrix.1.14.4.jar,  but for crawling
behind a proxy I found there is a Hetritrix module since version 2.0.

Question is:

How can I crawl behind a proxy with NetArchive ?

Version 2.0 of Heritrix can be used with NetarchiveSuite-3.14.0. and
make it through a proxy ?

Is there any way of telling Heritrix 1.14.4 to use a HTTP proxy ( I
already tried sytem-wide/environment/java  proxy settings, no luck).

Thanks in advance.


Ruben Tato
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk

More information about the NetarchiveSuite-users mailing list