[Netarchivesuite-users] Getting round 500 response and NullReferenceException
Kaare Fiedler Christiansen
kfc at statsbiblioteket.dk
Sun Nov 2 16:10:05 CET 2008
Hi,
This is really a bug on the Scottish Government website, but I seem to
have nailed it and found a workaround.
Heritrix sends out only a very basic set of HTTP headers when doing an
HTTP request. The headers sent include
User-Agent: Mozilla/5.0 (compatible; heritrix/1.5.0-200506132127
+http://somesite.example.com)
From: someemail at example.com
Connection: close
Host: www.scotland.gov.uk
And this is not enough for the website of www.scotland.gov.uk
As you saw, it gives an error code of 500, and the harvested page
contains a null-pointer-exception from the code running on the server.
So I investigated what other headers might be necessary. I noticed that
"wget" _was_ able to harvest the site, and the only extra header sent by
wget was
Accept: */*
So I edited that Heritrix order.xml template, to set
<stringList name="accept-headers">
<string>Accept: */*</string>
</stringList>
in the FetchHTTP object, and after that it seems the site harvests fine!
I hope this will help you.
Best,
Kåre Fiedler Christiansen
NetarchiveSuite developer
On Fri, 2008-10-31 at 15:07 +0100, Cunnea, Paul wrote:
> Hi,
>
>
>
> This may be one for the Hertrix list, but I thought I would try here
> first as we are using Netarchive (with Heritrix 1.12) - we’re still
> essentially novices at using Netarchive and Heritrix here at the
> National Library of Scotland.
>
>
>
> We are getting 500 internal error responses when attempting to archive
> a site (http://www.scotland.gov.uk/) – it seems to get the robots.txt,
> a redirect, then nothing else. Same result with additional seeds.
> Initial crawl ignored robots.txt, but we get the same result when
> using classic.
>
>
>
> We have replicated the problem using standalone Heritrix 1.14, but are
> able to archive the site using an alternative crawler. We’re assuming
> the problem lies with the website and how Heritrix is fetching
> content, but would like to know if there is anything we can do via the
> harvest template settings before contacting the website owner.
>
>
>
> Excerpt from crawl log:
>
>
>
> metadata://netarkivet.dk/crawl/reports/responsecode-report.txt?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 40
> [rescode] [#urls]
> 1 1
> 200 1
> 302 1
> 500 1
>
> metadata://netarkivet.dk/crawl/reports/seeds-report.txt?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 106
> [code] [status] [seed] [redirect]
> 302 CRAWLED http://www.scotland.gov.uk/ http://www.scotland.gov.uk/Home
>
> metadata://netarkivet.dk/crawl/logs/crawl.log?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 761
> 2008-10-29T14:35:05.238Z 1 60 dns:www.scotland.gov.uk P http://www.scotland.gov.uk/ text/dns #001 20081029143504818+82 sha1:YW3TTZVRWR66P5FJGU3M6H6RTC73JCPA - content-size:60
> 2008-10-29T14:35:05.682Z 200 214 http://www.scotland.gov.uk/robots.txt P http://www.scotland.gov.uk/ text/plain #003 20081029143505562+115 sha1:EXHMPB3HYORL26TZO5SEWZCFJMOLOGHE - content-size:579
> 2008-10-29T14:35:06.084Z 302 122 http://www.scotland.gov.uk/ - - text/html #001 20081029143505992+73 sha1:LO334SHJODRDP46VXYE6E66HX4TGCHNN - content-size:476,3t
> 2008-10-29T14:35:06.522Z 500 4602 http://www.scotland.gov.uk/Home R http://www.scotland.gov.uk/ text/html #003 20081029143506393+119 sha1:T2SCFPKKFQPTMRPWNQVI7QLJLY6V3KYO - content-size:4956
>
> metadata://netarkivet.dk/crawl/logs/local-errors.log?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143503 text/plain 0
>
> metadata://netarkivet.dk/crawl/logs/progress-statistics.log?heritrixVersion=1.12.1b&harvestid=63&jobid=286 127.0.0.1 20081029143506 text/plain 472
> 20081029143504 CRAWL RESUMED - Running
> timestamp discovered queued downloaded doc/s(avg) KB/s(avg) dl-failures busy-thread mem-use-KB heap-size-KB congestion max-depth avg-depth
> 20081029143506 CRAWL ENDING - Finished
> 2008-10-29T14:35:06Z 4 0 4 4(4) 5(5) 0 0 20045 33792 1 0 0
> 20081029143506 CRAWL ENDED - Finished
>
>
>
>
> When viewing via the proxy viewer it comes up with unhandled exception
> error –
>
>
>
> Exception Details: System.NullReferenceException: Object reference not
> set to an instance of an object.
>
>
>
> The stack trace is:
>
>
>
> NullReferenceException: Object reference not set to an instance of an
> object.]
>
> ScottishExecutive.PageCache.ServePage(String pgAlias) +272
>
> ASP.global_asax.Application_ResolveRequestCache(Object sender,
> EventArgs e) +181
>
>
> System.Web.SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() +92
>
> System.Web.HttpApplication.ExecuteStep(IExecutionStep step,
> Boolean& completedSynchronously) +64
>
>
>
>
>
> Thanks for any advice,
>
> Paul
>
>
>
>
>
> Paul Cunnea
>
> Digital Collections Manager
>
> National Library of Scotland
>
> t: +44-131-623-4671 e: p.cunnea at nls.uk
>
>
>
>
>
>
> ***********************************************************************
> Visit the National Library of Scotland online at www.nls.uk
>
> CELEBRATING 500 YEARS OF SCOTTISH PRINTING 1508-2008
> http://www.500yearsofprinting.org
> ***********************************************************************
> Please consider the environment before printing this e-mail.
>
> This communication is intended for the addressee(s) only. If you
> are not the intended recipient, please notify the ICT Helpdesk on
> +44 131 623 3700 or ict at nls.uk and delete this e-mail. The
> statements and opinions expressed in this message are those of the
> author and do not necessarily reflect those of the National Library of
> Scotland. The National Library of Scotland is a registered Scottish
> charity. Scottish Charity No. SC011086. This message is subject to the
> Data Protection Act 1998
> and Freedom of Information (Scotland) Act 2002 and has been
> scanned by MessageLabs.
> ***********************************************************************
More information about the NetarchiveSuite-users
mailing list