[Netarchivesuite-devel] Known problems are back

aponb at gmx.at aponb at gmx.at
Fri Aug 10 12:03:40 CEST 2018


After starting our yearly Domain Crawl we are experiencing a combination 
of known problems.
On the one hand hand, the  "Multiple duplicate Jobs Created" 
(https://sbforge.org/jira/browse/NAS-2682) happens to our daily crawls, 
which means that not only duplicate jobs will be created, but also there 
will be a deactivation of HarvestDefinitions, which is annoying, because 
you need to manual activate these Definitions, otherwise there will be 
no further crawl.
This is the log message:
20:14:34.445 WARN  d.n.h.scheduler.HarvestJobGenerator - Exception while 
scheduling harvestdefinition #105(20180630_EU_Ratspraesidentschaft2018). 
The harvestdefinition has been deactivated!
dk.netarkivet.common.exceptions.PermissionDenied: Somebody else must 
have updated HD #105: '20180630_EU_Ratspraesidentschaft2018' since 
edition 51, not updating
         at 
dk.netarkivet.harvester.datamodel.HarvestDefinitionDBDAO.update(HarvestDefinitionDBDAO.java:459)
         at 
dk.netarkivet.harvester.scheduler.HarvestJobGenerator$JobGeneratorTask$JobGeneratorThread.run(HarvestJobGenerator.java:256)


On the other hand, we do get the following error, which is known as 
"Heritrix Address already in use"-Bug (e.g. 
https://sbforge.org/jira/browse/NAS-1377 or 
https://sbforge.org/jira/browse/NAS-2477) and which was already 
discussed some times ago. Since starting our domain crawl, this happens 
the whole time. The half of Jobs of daily crawls are failing due to  
following exception: "dk.netarkivet.common.exceptions.IOFailure: Port 
XXXX already in use, or port is out of range". That also happens to jobs 
of the domain crawl, so you need constantly to resubmit failed jobs to 
get a full crawl. But normally this all should work automatically.

And we are not having duplicate ports on our crawler machines. We are 
using the same deploy settings since years.

Does anyone have any ideas for workarounds? That would be great, because 
especially the "Heritrix Address already in use"-Bug is really really 
disturbing our daily work.

Regards
a.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20180810/d737dd77/attachment.html>


More information about the Netarchivesuite-devel mailing list