[Netarchivesuite-users] help on crawling behind proxy

Ruben rtmoran at gmail.com
Tue Apr 19 18:49:37 CEST 2011


Many thanks!

I'm sorry I confused http-proxy-host with FetchHTTP (of course should
be included from the beginning :))

https://webarchive.jira.com/wiki/display/Heritrix/FetchHTTP+http-proxy-host
(saw it here)

Many thanks for your help, I'll try that!

Cheers!


2011/4/19 Bjarne Andersen <bja at statsbiblioteket.dk>:
> Hi Ruben.
>
> It should be perfectly possible to crawl through a proxy with the current netarchiveSuite 3.14 and heritrix 1.14.4
> The heritrix module FetchHTTP is not new to heritrix 2.0 at all - its the most crucial fetcher-modul of heritrix since most of the fetches of an internet crawler is http.
>
> What you have to do is to do some minor edits to the harvest templates in NetarchiveSuite.
>
> This is done by downloading them to your local PC from the menu (Definitions -> Edit Harvest Templates) - select a template a select "save to disk" in the select box.
>
> The edit the harvest template (its plain XML) - especially the 2 values in the FetchHTTP configuration
> <string name="http-proxy-host"/>
> <string name="http-proxy-port"/>
> to e.g.
> <string name="http-proxy-host">proxyhostname</string>
> <string name="http-proxy-port">80</string>
>
> After you have edited the XML-file you must upload it back to NetarchiveSuite again. Make sure to override the right template when you upload
>
> If all your crawls need to go through the proxy you will ned to do this trick with all your templates. I don't remember the number of different templates in a fresh install, but it shouldn't be that many
>
> best and good luck
> Bjarne Andersen
> Netarchive.dk
>
> ________________________________________
> Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af Ruben [rtmoran at gmail.com]
> Sendt: 19. april 2011 16:06
> Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
> Emne: [Netarchivesuite-users] help on crawling behind proxy
>
> Hi there,
>
> I'm testing Netarchive Suite on a network behind a proxy (seems to be
> mandatory here to stay behind the proxy).
>
> I see NetArchive 3.14 uses Heritrix.1.14.4.jar,  but for crawling
> behind a proxy I found there is a Hetritrix module since version 2.0.
> (FetchHTTP)
>
> Question is:
>
> How can I crawl behind a proxy with NetArchive ?
>
> Version 2.0 of Heritrix can be used with NetarchiveSuite-3.14.0. and
> make it through a proxy ?
>
> Is there any way of telling Heritrix 1.14.4 to use a HTTP proxy ( I
> already tried sytem-wide/environment/java  proxy settings, no luck).
>
>
> Thanks in advance.
>
>
> Cheers!
>
>
> --
> Ruben Tato
> --
> http://bentamor.wordpress.com
> http://outcampaign.org/
> http://mundodetraca.blogspot.com
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
> https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users
>
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
> https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users
>



-- 
Ruben Tato
--
http://bentamor.wordpress.com
http://outcampaign.org/
http://mundodetraca.blogspot.com




More information about the NetarchiveSuite-users mailing list