<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
</head>
<body dir="ltr">
<div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;" dir="ltr">
<p>No - I ran through all our current 7.3 harvester logs and Indexserver log</p>
<p>In the latest broadcrawls the dedup indexes was between 85 - 107 G unzipped withou any error.</p>
<p><br>
</p>
<p>Index server java version on RHEL6</p>
<p></p>
<div>java version "1.8.0_91"</div>
<div>Java(TM) SE Runtime Environment (build 1.8.0_91-b14)</div>
<div>Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)</div>
<div><br>
</div>
<div>Harvester java version on RHEL6 - RHEL8</div>
<div>
<div>[prod@kb-prod-har-001 ~]$ java -version</div>
<div>java version "1.8.0_151"</div>
<div>Java(TM) SE Runtime Environment (build 1.8.0_151-b12)</div>
<div>Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)</div>
<div><br>
</div>
Best regards</div>
<div>Tue</div>
<div><br>
</div>
<p></p>
<br>
<div style="color: rgb(0, 0, 0);">
<div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="x_divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>Fra:</b> Netarchivesuite-devel <netarchivesuite-devel-bounces@ml.sbforge.org> på vegne af aponb@gmx.at <aponb@gmx.at><br>
<b>Sendt:</b> 17. august 2022 12:59<br>
<b>Til:</b> netarchivesuite-devel@ml.sbforge.org<br>
<b>Emne:</b> [Netarchivesuite-devel] corrupt GZIP Trailer</font>
<div> </div>
</div>
</div>
<font size="2"><span style="font-size:10pt;">
<div class="PlainText">Dear all,<br>
<br>
we are experiencing a problem with deduplication during our Domain<br>
Crawl. There are IOFailures due to "corrupt GZIP Trailer" by unzipping<br>
the index which is based by thousands of previous crawled jobs.<br>
Deduplication for the daily small crawls is working.<br>
<br>
This is how the exception looks:<br>
<br>
4:53:53.703 WARN dk.netarkivet.common.utils.FileUtils - Error writing<br>
stream to file<br>
'/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.<br>
java.util.zip.ZipException: Corrupt GZIP trailer<br>
at<br>
java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:225)<br>
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:119)<br>
at<br>
dk.netarkivet.common.utils.LargeFileGZIPInputStream.read(LargeFileGZIPInputStream.java:67)<br>
at java.io.FilterInputStream.read(FilterInputStream.java:107)<br>
at<br>
dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:862)<br>
at<br>
dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)<br>
at<br>
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.unzipAndDeleteRemoteFile(IndexRequestClient.java:256)<br>
at<br>
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.gunzipToDir(IndexRequestClient.java:231)<br>
at<br>
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:196)<br>
at<br>
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:63)<br>
at<br>
dk.netarkivet.harvester.indexserver.FileBasedCache.cache(FileBasedCache.java:146)<br>
at<br>
dk.netarkivet.harvester.indexserver.FileBasedCache.getIndex(FileBasedCache.java:203)<br>
at<br>
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.getIndex(IndexRequestClient.java:63)<br>
at<br>
dk.netarkivet.harvester.heritrix3.HarvestJob.fetchDeduplicateIndex(HarvestJob.java:228)<br>
at<br>
dk.netarkivet.harvester.heritrix3.HarvestJob.writeHarvestFiles(HarvestJob.java:171)<br>
at<br>
dk.netarkivet.harvester.heritrix3.HarvestJob.init(HarvestJob.java:85)<br>
at<br>
dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:481)<br>
14:54:03.015 WARN d.n.h.i.d.IndexRequestClient - IOFailure during<br>
unzipping of index<br>
dk.netarkivet.common.exceptions.IOFailure: Error writing stream to file<br>
'/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.<br>
at<br>
dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:871)<br>
at<br>
dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)<br>
<br>
I know about the issue with big zip files and the border of 2 GB, which<br>
should be fixed with java 8 which we are using.<br>
Testprogramms as mentioned in<br>
<a href="https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383" id="LPlnk142151" previewremoved="true">https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383</a> where
<div id="LPBorder_GT_16607371807580.7982593146422519" style="margin-bottom: 20px; overflow: auto; width: 100%; text-indent: 0px;">
<table id="LPContainer_16607371807560.19177089461595664" role="presentation" cellspacing="0" style="width: 90%; background-color: rgb(255, 255, 255); position: relative; overflow: auto; padding-top: 20px; padding-bottom: 20px; margin-top: 20px; border-top: 1px dotted rgb(200, 200, 200); border-bottom: 1px dotted rgb(200, 200, 200);">
<tbody>
<tr valign="top" style="border-spacing: 0px;">
<td id="TextCell_16607371807570.20159540121960706" colspan="2" style="vertical-align: top; position: relative; padding: 0px; display: table-cell;">
<div id="LPRemovePreviewContainer_16607371807570.12308813494486048"></div>
<div id="LPTitle_16607371807570.7645815534688045" style="top: 0px; color: rgb(0, 120, 215); font-weight: 400; font-size: 21px; font-family: wf_segoe-ui_light, "Segoe UI Light", "Segoe WP Light", "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif; line-height: 21px;">
<a id="LPUrlAnchor_16607371807570.7128589832299186" href="https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383" target="_blank" style="text-decoration: none;">Bug ID: JDK-6599383 Unable to open zip files more than 2GB in size</a></div>
<div id="LPMetadata_16607371807570.550107620018141" style="margin: 10px 0px 16px; color: rgb(102, 102, 102); font-weight: 400; font-family: wf_segoe-ui_normal, "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif; font-size: 14px; line-height: 14px;">
bugs.java.com</div>
<div id="LPDescription_16607371807570.22473025469257402" style="display: block; color: rgb(102, 102, 102); font-weight: 400; font-family: wf_segoe-ui_normal, "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif; font-size: 14px; line-height: 20px; max-height: 100px; overflow: hidden;">
Component: core-libs | Sub-Component: java.util.jar</div>
</td>
</tr>
</tbody>
</table>
</div>
<br>
<br>
running without issues.<br>
<br>
We had that problem in the 1st stage with nas 7.3 and now also in the<br>
2nd stage, now using NAS 7.4.1 and with totally cleand the cache on alle<br>
machines.<br>
<br>
Had anybody this problem before in the last year or does anybody have an<br>
idea what the problem could be?<br>
<br>
Thanks for reading<br>
Andreas<br>
<br>
_______________________________________________<br>
Netarchivesuite-devel mailing list<br>
Netarchivesuite-devel@ml.sbforge.org<br>
<a href="https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel" id="LPlnk785855" previewremoved="true">https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel</a><br>
</div>
</span></font></div>
</div>
</body>
</html>