<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body>
<div class="moz-cite-prefix">
<div class="moz-cite-prefix">Hi Tue,</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">thanks for your answer.</div>
<div class="moz-cite-prefix">So I am running Indexserver and
Harvester on CentOs 7.9.2009 with</div>
<div class="moz-cite-prefix">openjdk version "1.8.0_202"<br>
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_202-b08)<br>
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.202-b08, mixed
mode</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">first I was also thinking of a
Harddisk Problem. But on all Harvester Machines simultaneously
...</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Thanks for checking!</div>
<div class="moz-cite-prefix"><br>
</div>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Am 17.08.22 um 14:00 schrieb Tue
Hejlskov Larsen:<br>
</div>
<blockquote type="cite"
cite="mid:9a211b64857241e2bc7c4bc42c78ebb9@kb.dk">
<pre class="moz-quote-pre" wrap="">No - I ran through all our current 7.3 harvester logs and Indexserver log
In the latest broadcrawls the dedup indexes was between 85 - 107 G unzipped withou any error.
Index server java version on RHEL6
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
Harvester java version on RHEL6 - RHEL8
[prod@kb-prod-har-001 ~]$ java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
Best regards
Tue
________________________________
Fra: Netarchivesuite-devel <a class="moz-txt-link-rfc2396E" href="mailto:netarchivesuite-devel-bounces@ml.sbforge.org"><netarchivesuite-devel-bounces@ml.sbforge.org></a> på vegne af <a class="moz-txt-link-abbreviated" href="mailto:aponb@gmx.at">aponb@gmx.at</a> <a class="moz-txt-link-rfc2396E" href="mailto:aponb@gmx.at"><aponb@gmx.at></a>
Sendt: 17. august 2022 12:59
Til: <a class="moz-txt-link-abbreviated" href="mailto:netarchivesuite-devel@ml.sbforge.org">netarchivesuite-devel@ml.sbforge.org</a>
Emne: [Netarchivesuite-devel] corrupt GZIP Trailer
Dear all,
we are experiencing a problem with deduplication during our Domain
Crawl. There are IOFailures due to "corrupt GZIP Trailer" by unzipping
the index which is based by thousands of previous crawled jobs.
Deduplication for the daily small crawls is working.
This is how the exception looks:
4:53:53.703 WARN dk.netarkivet.common.utils.FileUtils - Error writing
stream to file
'/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.
java.util.zip.ZipException: Corrupt GZIP trailer
at
java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:225)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:119)
at
dk.netarkivet.common.utils.LargeFileGZIPInputStream.read(LargeFileGZIPInputStream.java:67)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at
dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:862)
at
dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.unzipAndDeleteRemoteFile(IndexRequestClient.java:256)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.gunzipToDir(IndexRequestClient.java:231)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:196)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:63)
at
dk.netarkivet.harvester.indexserver.FileBasedCache.cache(FileBasedCache.java:146)
at
dk.netarkivet.harvester.indexserver.FileBasedCache.getIndex(FileBasedCache.java:203)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.getIndex(IndexRequestClient.java:63)
at
dk.netarkivet.harvester.heritrix3.HarvestJob.fetchDeduplicateIndex(HarvestJob.java:228)
at
dk.netarkivet.harvester.heritrix3.HarvestJob.writeHarvestFiles(HarvestJob.java:171)
at
dk.netarkivet.harvester.heritrix3.HarvestJob.init(HarvestJob.java:85)
at
dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:481)
14:54:03.015 WARN d.n.h.i.d.IndexRequestClient - IOFailure during
unzipping of index
dk.netarkivet.common.exceptions.IOFailure: Error writing stream to file
'/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.
at
dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:871)
at
dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)
I know about the issue with big zip files and the border of 2 GB, which
should be fixed with java 8 which we are using.
Testprogramms as mentioned in
<a class="moz-txt-link-freetext" href="https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383">https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383</a> where
Bug ID: JDK-6599383 Unable to open zip files more than 2GB in size<a class="moz-txt-link-rfc2396E" href="https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383"><https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383></a>
bugs.java.com
Component: core-libs | Sub-Component: java.util.jar
running without issues.
We had that problem in the 1st stage with nas 7.3 and now also in the
2nd stage, now using NAS 7.4.1 and with totally cleand the cache on alle
machines.
Had anybody this problem before in the last year or does anybody have an
idea what the problem could be?
Thanks for reading
Andreas
_______________________________________________
Netarchivesuite-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Netarchivesuite-devel@ml.sbforge.org">Netarchivesuite-devel@ml.sbforge.org</a>
<a class="moz-txt-link-freetext" href="https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel">https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel</a>
</pre>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
Netarchivesuite-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Netarchivesuite-devel@ml.sbforge.org">Netarchivesuite-devel@ml.sbforge.org</a>
<a class="moz-txt-link-freetext" href="https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel">https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel</a>
</pre>
</blockquote>
<p><br>
</p>
</body>
</html>