[Netarchivesuite-users] Recognize these problems?
Peter Svanberg
Peter.Svanberg at kb.se
Tue Dec 3 11:52:11 CET 2024
Hello!
We had crawling problems last week on a broad crawl. Many harvests stopped for different reasons. My impression is that it boils down to server resource problems, that the servers was overloaded. Through resubmitting stopped jobs we could end the crawl.
Four symptoms:
Problems when retrieving Heritrix status
· java.lang.RuntimeException: Exception during crawl
This is most frequent, I think. Nas 7.6 has fixes Colin made for nullpointer problems but we used 7.5 on this crawl. (Will change!)
Heritrix is asked via local HTTP request to deliver crawl status and that request fails. Unfortunately no information is saved/displayed on why and how it fails.
java.lang.NullPointerException: null
at dk.netarkivet.harvester.heritrix3.controller.HeritrixController.getCrawlServiceAttributes(HeritrixController.java:438)
Problems connecting to Heritrix
· dk.netarkivet.common.exceptions.IOFailure: Heritrix3 wrapper could not connect to Heritrix3. Resultstate = -2
09:42:21.474 INFO d.n.h.h.c.AbstractRestHeritrixController - Heritrix3 engine launched successfully
09:43:47.721 ERROR d.n.h.h.c.HeritrixController - Heritrix3 wrapper could not connect to Heritrix3. Resultstate = -2
org.apache.http.conn.HttpHostConnectException: Connect to kw3-harvester18.kb.se<http://kw3-harvester18.kb.se/>:8223 [kw3-harvester18.kb.se/193.10.72.213<http://kw3-harvester18.kb.se/193.10.72.213>] failed: Connection refused (Connection refused)
Port problems
· dk.netarkivet.common.exceptions.IOFailure: Port 8223 already in use, or port is out of range
09:46:02.593 INFO d.n.harvester.heritrix3.HarvestJob - Starting crawl of job : 58331
09:46:02.701 INFO d.n.h.h.HeritrixLauncherAbstract - Make the template ready for Heritrix3
09:46:05.346 WARN d.n.h.h.HarvestControllerServer - Error during crawling. The crawl may have been only partially completed.
dk.netarkivet.common.exceptions.IOFailure: Port 8223 already in use, or port is out of range
Problems with communication with Heritrix
·
· dk.netarkivet.common.exceptions.IOFailure: Unknown error during communication with heritrix3
·
· 03:18:28.971 INFO d.n.h.h.c.FrontierReportAnalyzer - Generated full Heritrix frontier report in 00d 00:05:14.
03:18:36.212 INFO d.n.h.h.c.FrontierReportAnalyzer - Applied filter dk.netarkivet.harvester.harvesting.frontier.TopTotalEnqueuesFilter to full frontier report, this took 00d 00:00:05.
03:25:55.981 ERROR o.n.h.xmlutils.XmlErrorHandler - SAX parsing error!
org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
03:25:56.897 ERROR o.n.h.xmlutils.XmlValidator - Exception validating XML stream!
org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
Abrupt interruption of job termination
13:04:34.303 INFO d.n.h.h.m.MetadataFileWriterWarc - snapshot1/58248_1732563927633/heritrix3/jobs/58248_1732563927633/logs/uri-errors.log 1689516
13:04:35.142 INFO d.n.h.heritrix3.HarvestDocumentation - Looking for files not having harvestprefix '58248-129'
13:04:35.468 INFO d.n.c.utils.batch.BatchLocalFiles - The batchjob 'class dk.netarkivet.common.utils.cdx.ArchiveExtractCDXJob' has run for 0 seconds and has reached file '58248-129-20241125194557725-00002-svep_kw3-harvester17.kb.se<http://58248-129-20241125194557725-00002-svep_kw3-harvester17.kb.se/>.warc.gz', which is number 1 out of 1
13:04:35.468 INFO d.n.c.utils.archive.ArchiveBatchJob - Processing archive file: 58248-129-20241125194557725-00002-svep_kw3-harvester17.kb.se<http://58248-129-20241125194557725-00002-svep_kw3-harvester17.kb.se/>.warc.gz
(And here it ends.)
[KB Logo]<https://www.kb.se/>
Peter Svanberg
teknisk handläggare
Insamling och metadata
Film, spel, noter och webb
Kungliga biblioteket
Box 5039, 102 41 Stockholm
Besöksadress: Karlavägen 96, Stockholm
010-709 32 78
Peter.Svanberg at kb.se
www.kb.se<https://www.kb.se/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20241203/665f0bfe/attachment-0001.html>
More information about the NetarchiveSuite-users
mailing list