[Netarchivesuite-users] Getting round 500 response and NullReferenceException

Kaare Fiedler Christiansen kfc at statsbiblioteket.dk
Sun Nov 2 16:10:05 CET 2008


Hi,

This is really a bug on the Scottish Government website, but I seem to
have nailed it and found a workaround.

Heritrix sends out only a very basic set of HTTP headers when doing an
HTTP request. The headers sent include

  User-Agent: Mozilla/5.0 (compatible; heritrix/1.5.0-200506132127
+http://somesite.example.com)
  From: someemail at example.com
  Connection: close
  Host: www.scotland.gov.uk

And this is not enough for the website of www.scotland.gov.uk
As you saw, it gives an error code of 500, and the harvested page
contains a null-pointer-exception from the code running on the server.

So I investigated what other headers might be necessary. I noticed that
"wget" _was_ able to harvest the site, and the only extra header sent by
wget was
  Accept: */*

So I edited that Heritrix order.xml template, to set

        <stringList name="accept-headers">
          <string>Accept: */*</string>
        </stringList>

in the FetchHTTP object, and after that it seems the site harvests fine!

I hope this will help you.

Best,
  Kåre Fiedler Christiansen
  NetarchiveSuite developer

On Fri, 2008-10-31 at 15:07 +0100, Cunnea, Paul wrote:
> Hi,
> 
>  
> 
> This may be one for the Hertrix list, but I thought I would try here
> first as we are using Netarchive (with Heritrix 1.12) - we’re still
> essentially novices at using Netarchive and Heritrix here at the
> National Library of Scotland.
> 
>  
> 
> We are getting 500 internal error responses when attempting to archive
> a site (http://www.scotland.gov.uk/) – it seems to get the robots.txt,
> a redirect, then nothing else.  Same result with additional seeds.
> Initial crawl ignored robots.txt, but we get the same result when
> using classic.
> 
>  
> 
> We have replicated the problem using standalone Heritrix 1.14, but are
> able to archive the site using an alternative crawler. We’re assuming
> the problem lies with the website and how Heritrix is fetching
> content, but would like to know if there is anything we can do via the
> harvest template settings before contacting the website owner.
> 
>  
> 
> Excerpt from crawl log:
> 
> 
> 
> metadata://netarkivet.dk/crawl/reports/responsecode-report.txt?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 40
> [rescode] [#urls]
> 1 1
> 200 1
> 302 1
> 500 1
>  
> metadata://netarkivet.dk/crawl/reports/seeds-report.txt?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 106
> [code] [status] [seed] [redirect]
> 302 CRAWLED http://www.scotland.gov.uk/ http://www.scotland.gov.uk/Home
>  
> metadata://netarkivet.dk/crawl/logs/crawl.log?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 761
> 2008-10-29T14:35:05.238Z     1         60 dns:www.scotland.gov.uk P http://www.scotland.gov.uk/ text/dns #001 20081029143504818+82 sha1:YW3TTZVRWR66P5FJGU3M6H6RTC73JCPA - content-size:60
> 2008-10-29T14:35:05.682Z   200        214 http://www.scotland.gov.uk/robots.txt P http://www.scotland.gov.uk/ text/plain #003 20081029143505562+115 sha1:EXHMPB3HYORL26TZO5SEWZCFJMOLOGHE - content-size:579
> 2008-10-29T14:35:06.084Z   302        122 http://www.scotland.gov.uk/ - - text/html #001 20081029143505992+73 sha1:LO334SHJODRDP46VXYE6E66HX4TGCHNN - content-size:476,3t
> 2008-10-29T14:35:06.522Z   500       4602 http://www.scotland.gov.uk/Home R http://www.scotland.gov.uk/ text/html #003 20081029143506393+119 sha1:T2SCFPKKFQPTMRPWNQVI7QLJLY6V3KYO - content-size:4956
>  
> metadata://netarkivet.dk/crawl/logs/local-errors.log?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143503 text/plain 0
>  
> metadata://netarkivet.dk/crawl/logs/progress-statistics.log?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 472
> 20081029143504 CRAWL RESUMED - Running
>            timestamp  discovered      queued   downloaded       doc/s(avg)  KB/s(avg)   dl-failures   busy-thread   mem-use-KB  heap-size-KB   congestion   max-depth   avg-depth
> 20081029143506 CRAWL ENDING - Finished
> 2008-10-29T14:35:06Z           4           0            4             4(4)       5(5)             0             0        20045         33792            1           0           0
> 20081029143506 CRAWL ENDED - Finished
>  
> 
>  
> 
> When viewing via the proxy viewer it comes up with unhandled exception
> error – 
> 
>  
> 
> Exception Details: System.NullReferenceException: Object reference not
> set to an instance of an object.
> 
>  
> 
> The stack trace is:
> 
>  
> 
> NullReferenceException: Object reference not set to an instance of an
> object.]
> 
>    ScottishExecutive.PageCache.ServePage(String pgAlias) +272
> 
>    ASP.global_asax.Application_ResolveRequestCache(Object sender,
> EventArgs e) +181
> 
> 
> System.Web.SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() +92
> 
>    System.Web.HttpApplication.ExecuteStep(IExecutionStep step,
> Boolean& completedSynchronously) +64
> 
> 
> 
>  
> 
> Thanks for any advice,
> 
> Paul
> 
>  
> 
>  
> 
> Paul Cunnea
> 
> Digital Collections Manager
> 
> National Library of Scotland
> 
> t: +44-131-623-4671  e: p.cunnea at nls.uk
> 
> 
>  
> 
> 
> 
> ***********************************************************************
> Visit the National Library of Scotland online at www.nls.uk
> 
> CELEBRATING 500 YEARS OF SCOTTISH PRINTING 1508-2008
> http://www.500yearsofprinting.org
> ***********************************************************************
> Please consider the environment before printing this e-mail.
> 
> This communication is intended for the addressee(s) only. If you
> are not the intended recipient, please notify the ICT Helpdesk on
> +44 131 623 3700 or ict at nls.uk and delete this e-mail. The
> statements and opinions expressed in this message are those of the
> author and do not necessarily reflect those of the National Library of
> Scotland. The National Library of Scotland is a registered Scottish
> charity. Scottish Charity No. SC011086. This message is subject to the
> Data Protection Act 1998 
> and Freedom of Information (Scotland) Act 2002 and has been 
> scanned by MessageLabs.
> ***********************************************************************




More information about the NetarchiveSuite-users mailing list