[Netarchivesuite-users] help on crawling behind proxy
bja at statsbiblioteket.dk
Tue Apr 19 18:23:27 CEST 2011
It should be perfectly possible to crawl through a proxy with the current netarchiveSuite 3.14 and heritrix 1.14.4
The heritrix module FetchHTTP is not new to heritrix 2.0 at all - its the most crucial fetcher-modul of heritrix since most of the fetches of an internet crawler is http.
What you have to do is to do some minor edits to the harvest templates in NetarchiveSuite.
This is done by downloading them to your local PC from the menu (Definitions -> Edit Harvest Templates) - select a template a select "save to disk" in the select box.
The edit the harvest template (its plain XML) - especially the 2 values in the FetchHTTP configuration
After you have edited the XML-file you must upload it back to NetarchiveSuite again. Make sure to override the right template when you upload
If all your crawls need to go through the proxy you will ned to do this trick with all your templates. I don't remember the number of different templates in a fresh install, but it shouldn't be that many
best and good luck
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af Ruben [rtmoran at gmail.com]
Sendt: 19. april 2011 16:06
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] help on crawling behind proxy
I'm testing Netarchive Suite on a network behind a proxy (seems to be
mandatory here to stay behind the proxy).
I see NetArchive 3.14 uses Heritrix.1.14.4.jar, but for crawling
behind a proxy I found there is a Hetritrix module since version 2.0.
How can I crawl behind a proxy with NetArchive ?
Version 2.0 of Heritrix can be used with NetarchiveSuite-3.14.0. and
make it through a proxy ?
Is there any way of telling Heritrix 1.14.4 to use a HTTP proxy ( I
already tried sytem-wide/environment/java proxy settings, no luck).
Thanks in advance.
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
More information about the NetarchiveSuite-users