[Netarchivesuite-devel] corrupt GZIP Trailer

Tue Hejlskov Larsen tlr at kb.dk
Wed Aug 17 14:00:12 CEST 2022


No  - I ran through all our current 7.3 harvester logs and Indexserver log

In the latest broadcrawls the dedup indexes was between 85 - 107 G unzipped withou any error.


Index server java version on RHEL6

java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

Harvester java version on RHEL6 - RHEL8
[prod at kb-prod-har-001 ~]$ java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

Best regards
Tue



________________________________
Fra: Netarchivesuite-devel <netarchivesuite-devel-bounces at ml.sbforge.org> på vegne af aponb at gmx.at <aponb at gmx.at>
Sendt: 17. august 2022 12:59
Til: netarchivesuite-devel at ml.sbforge.org
Emne: [Netarchivesuite-devel] corrupt GZIP Trailer

Dear all,

we are experiencing a problem with deduplication during our Domain
Crawl. There are IOFailures due to "corrupt GZIP Trailer" by unzipping
the index which is based by thousands of previous crawled jobs.
Deduplication for the daily small crawls is working.

This is how the exception looks:

4:53:53.703 WARN  dk.netarkivet.common.utils.FileUtils - Error writing
stream to file
'/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.
java.util.zip.ZipException: Corrupt GZIP trailer
         at
java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:225)
         at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:119)
         at
dk.netarkivet.common.utils.LargeFileGZIPInputStream.read(LargeFileGZIPInputStream.java:67)
         at java.io.FilterInputStream.read(FilterInputStream.java:107)
         at
dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:862)
         at
dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)
         at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.unzipAndDeleteRemoteFile(IndexRequestClient.java:256)
         at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.gunzipToDir(IndexRequestClient.java:231)
         at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:196)
         at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:63)
         at
dk.netarkivet.harvester.indexserver.FileBasedCache.cache(FileBasedCache.java:146)
         at
dk.netarkivet.harvester.indexserver.FileBasedCache.getIndex(FileBasedCache.java:203)
         at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.getIndex(IndexRequestClient.java:63)
         at
dk.netarkivet.harvester.heritrix3.HarvestJob.fetchDeduplicateIndex(HarvestJob.java:228)
         at
dk.netarkivet.harvester.heritrix3.HarvestJob.writeHarvestFiles(HarvestJob.java:171)
         at
dk.netarkivet.harvester.heritrix3.HarvestJob.init(HarvestJob.java:85)
         at
dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:481)
14:54:03.015 WARN  d.n.h.i.d.IndexRequestClient - IOFailure during
unzipping of index
dk.netarkivet.common.exceptions.IOFailure: Error writing stream to file
'/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.
         at
dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:871)
         at
dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)

I know about the issue with big zip files and the border of 2 GB, which
should be fixed with java 8 which we are using.
Testprogramms as mentioned in
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383 where
Bug ID: JDK-6599383 Unable to open zip files more than 2GB in size<https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383>
bugs.java.com
Component: core-libs | Sub-Component: java.util.jar



running without issues.

We had that problem in the 1st stage with nas 7.3 and now also in the
2nd stage, now using NAS 7.4.1 and with totally cleand the cache on alle
machines.

Had anybody this problem before in the last year or does anybody have an
idea what the problem could be?

Thanks for reading
Andreas

_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20220817/f657d135/attachment.html>


More information about the Netarchivesuite-devel mailing list