[Netarchivesuite-users] Getting round 500 responseand NullReferenceException

Cunnea, Paul p.cunnea at nls.uk
Thu Nov 20 19:47:19 CET 2008


Thanks for this Kaare - nice and simple solution and works like a dream!

Is there any reason why we shouldn't retain this as part of our default template?

Cheers, and thanks for your help,
Paul

-----Original Message-----
From: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] On Behalf Of Kaare Fiedler Christiansen
Sent: 02 November 2008 15:10
To: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Subject: Re: [Netarchivesuite-users] Getting round 500 responseand NullReferenceException

Hi,

This is really a bug on the Scottish Government website, but I seem to
have nailed it and found a workaround.

Heritrix sends out only a very basic set of HTTP headers when doing an
HTTP request. The headers sent include

  User-Agent: Mozilla/5.0 (compatible; heritrix/1.5.0-200506132127
+http://somesite.example.com)
  From: someemail at example.com
  Connection: close
  Host: www.scotland.gov.uk

And this is not enough for the website of www.scotland.gov.uk
As you saw, it gives an error code of 500, and the harvested page
contains a null-pointer-exception from the code running on the server.

So I investigated what other headers might be necessary. I noticed that
"wget" _was_ able to harvest the site, and the only extra header sent by
wget was
  Accept: */*

So I edited that Heritrix order.xml template, to set

        <stringList name="accept-headers">
          <string>Accept: */*</string>
        </stringList>

in the FetchHTTP object, and after that it seems the site harvests fine!

I hope this will help you.

Best,
  Kåre Fiedler Christiansen
  NetarchiveSuite developer

On Fri, 2008-10-31 at 15:07 +0100, Cunnea, Paul wrote:
> Hi,
> 
>  
> 
> This may be one for the Hertrix list, but I thought I would try here
> first as we are using Netarchive (with Heritrix 1.12) - we're still
> essentially novices at using Netarchive and Heritrix here at the
> National Library of Scotland.
> 
>  
> 
> We are getting 500 internal error responses when attempting to archive
> a site (http://www.scotland.gov.uk/) - it seems to get the robots.txt,
> a redirect, then nothing else.  Same result with additional seeds.
> Initial crawl ignored robots.txt, but we get the same result when
> using classic.
> 
>  
> 
> We have replicated the problem using standalone Heritrix 1.14, but are
> able to archive the site using an alternative crawler. We're assuming
> the problem lies with the website and how Heritrix is fetching
> content, but would like to know if there is anything we can do via the
> harvest template settings before contacting the website owner.
> 
>  
> 
> Excerpt from crawl log:
> 
> 
> 
> metadata://netarkivet.dk/crawl/reports/responsecode-report.txt?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 40
> [rescode] [#urls]
> 1 1
> 200 1
> 302 1
> 500 1
>  
> metadata://netarkivet.dk/crawl/reports/seeds-report.txt?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 106
> [code] [status] [seed] [redirect]
> 302 CRAWLED http://www.scotland.gov.uk/ http://www.scotland.gov.uk/Home
>  
> metadata://netarkivet.dk/crawl/logs/crawl.log?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 761
> 2008-10-29T14:35:05.238Z     1         60 dns:www.scotland.gov.uk P http://www.scotland.gov.uk/ text/dns #001 20081029143504818+82 sha1:YW3TTZVRWR66P5FJGU3M6H6RTC73JCPA - content-size:60
> 2008-10-29T14:35:05.682Z   200        214 http://www.scotland.gov.uk/robots.txt P http://www.scotland.gov.uk/ text/plain #003 20081029143505562+115 sha1:EXHMPB3HYORL26TZO5SEWZCFJMOLOGHE - content-size:579
> 2008-10-29T14:35:06.084Z   302        122 http://www.scotland.gov.uk/ - - text/html #001 20081029143505992+73 sha1:LO334SHJODRDP46VXYE6E66HX4TGCHNN - content-size:476,3t
> 2008-10-29T14:35:06.522Z   500       4602 http://www.scotland.gov.uk/Home R http://www.scotland.gov.uk/ text/html #003 20081029143506393+119 sha1:T2SCFPKKFQPTMRPWNQVI7QLJLY6V3KYO - content-size:4956
>  
> metadata://netarkivet.dk/crawl/logs/local-errors.log?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143503 text/plain 0
>  
> metadata://netarkivet.dk/crawl/logs/progress-statistics.log?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 472
> 20081029143504 CRAWL RESUMED - Running
>            timestamp  discovered      queued   downloaded       doc/s(avg)  KB/s(avg)   dl-failures   busy-thread   mem-use-KB  heap-size-KB   congestion   max-depth   avg-depth
> 20081029143506 CRAWL ENDING - Finished
> 2008-10-29T14:35:06Z           4           0            4             4(4)       5(5)             0             0        20045         33792            1           0           0
> 20081029143506 CRAWL ENDED - Finished
>  
> 
>  
> 
> When viewing via the proxy viewer it comes up with unhandled exception
> error - 
> 
>  
> 
> Exception Details: System.NullReferenceException: Object reference not
> set to an instance of an object.
> 
>  
> 
> The stack trace is:
> 
>  
> 
> NullReferenceException: Object reference not set to an instance of an
> object.]
> 
>    ScottishExecutive.PageCache.ServePage(String pgAlias) +272
> 
>    ASP.global_asax.Application_ResolveRequestCache(Object sender,
> EventArgs e) +181
> 
> 
> System.Web.SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() +92
> 
>    System.Web.HttpApplication.ExecuteStep(IExecutionStep step,
> Boolean& completedSynchronously) +64
> 
> 
> 
>  
> 
> Thanks for any advice,
> 
> Paul
> 
>  
> 
>  
> 
> Paul Cunnea
> 
> Digital Collections Manager
> 
> National Library of Scotland
> 
> t: +44-131-623-4671  e: p.cunnea at nls.uk
> 
> 
>  
> 
> 
> 
> ***********************************************************************
> Visit the National Library of Scotland online at www.nls.uk
> 
> CELEBRATING 500 YEARS OF SCOTTISH PRINTING 1508-2008
> http://www.500yearsofprinting.org
> ***********************************************************************
> Please consider the environment before printing this e-mail.
> 
> This communication is intended for the addressee(s) only. If you
> are not the intended recipient, please notify the ICT Helpdesk on
> +44 131 623 3700 or ict at nls.uk and delete this e-mail. The
> statements and opinions expressed in this message are those of the
> author and do not necessarily reflect those of the National Library of
> Scotland. The National Library of Scotland is a registered Scottish
> charity. Scottish Charity No. SC011086. This message is subject to the
> Data Protection Act 1998 
> and Freedom of Information (Scotland) Act 2002 and has been 
> scanned by MessageLabs.
> ***********************************************************************

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users

***********************************************************************
Visit the National Library of Scotland online at www.nls.uk

CELEBRATING 500 YEARS OF SCOTTISH PRINTING 1508-2008
http://www.500yearsofprinting.org
***********************************************************************
Please consider the environment before printing this e-mail.

This communication is intended for the addressee(s) only. If you
are not the intended recipient, please notify the ICT Helpdesk on
+44 131 623 3700 or ict at nls.uk and delete this e-mail.  The
statements and opinions expressed in this message are those of the
author and do not necessarily reflect those of the National Library of
Scotland.  The National Library of Scotland is a registered Scottish charity.  Scottish Charity No. SC011086.  This message is subject to the Data Protection Act 1998 
and Freedom of Information (Scotland) Act 2002 and has been 
scanned by MessageLabs.
***********************************************************************




More information about the NetarchiveSuite-users mailing list