<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
After starting our yearly Domain Crawl we are experiencing a
combination of known problems.<br>
On the one hand hand, the "Multiple duplicate Jobs Created"
(<a class="moz-txt-link-freetext" href="https://sbforge.org/jira/browse/NAS-2682">https://sbforge.org/jira/browse/NAS-2682</a>) happens to our daily
crawls, which means that not only duplicate jobs will be created,
but also there will be a deactivation of HarvestDefinitions, which
is annoying, because you need to manual activate these Definitions,
otherwise there will be no further crawl.<br>
This is the log message:<br>
20:14:34.445 WARN d.n.h.scheduler.HarvestJobGenerator - Exception
while scheduling harvestdefinition
#105(20180630_EU_Ratspraesidentschaft2018). The harvestdefinition
has been deactivated!<br>
dk.netarkivet.common.exceptions.PermissionDenied: Somebody else must
have updated HD #105: '20180630_EU_Ratspraesidentschaft2018' since
edition 51, not updating<br>
at
dk.netarkivet.harvester.datamodel.HarvestDefinitionDBDAO.update(HarvestDefinitionDBDAO.java:459)<br>
at
dk.netarkivet.harvester.scheduler.HarvestJobGenerator$JobGeneratorTask$JobGeneratorThread.run(HarvestJobGenerator.java:256)<br>
<br>
<br>
On the other hand, we do get the following error, which is known as
"Heritrix Address already in use"-Bug (e.g.
<a class="moz-txt-link-freetext" href="https://sbforge.org/jira/browse/NAS-1377">https://sbforge.org/jira/browse/NAS-1377</a> or
<a class="moz-txt-link-freetext" href="https://sbforge.org/jira/browse/NAS-2477">https://sbforge.org/jira/browse/NAS-2477</a>) and which was already
discussed some times ago. Since starting our domain crawl, this
happens the whole time. The half of Jobs of daily crawls are failing
due to following exception:
"dk.netarkivet.common.exceptions.IOFailure: Port XXXX already in
use, or port is out of range". That also happens to jobs of the
domain crawl, so you need constantly to resubmit failed jobs to get
a full crawl. But normally this all should work automatically. <br>
<br>
And we are not having duplicate ports on our crawler machines. We
are using the same deploy settings since years.<br>
<br>
Does anyone have any ideas for workarounds? That would be great,
because especially the "Heritrix Address already in use"-Bug is
really really disturbing our daily work.<br>
<br>
Regards<br>
a.<br>
<br>
<h1 id="summary-val"><br>
</h1>
</body>
</html>