[Netarchivesuite-devel] corrupt GZIP Trailer
Tue Hejlskov Larsen
tlr at kb.dk
Wed Aug 17 14:00:12 CEST 2022
No - I ran through all our current 7.3 harvester logs and Indexserver log
In the latest broadcrawls the dedup indexes was between 85 - 107 G unzipped withou any error.
Index server java version on RHEL6
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
Harvester java version on RHEL6 - RHEL8
[prod at kb-prod-har-001 ~]$ java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
Best regards
Tue
________________________________
Fra: Netarchivesuite-devel <netarchivesuite-devel-bounces at ml.sbforge.org> på vegne af aponb at gmx.at <aponb at gmx.at>
Sendt: 17. august 2022 12:59
Til: netarchivesuite-devel at ml.sbforge.org
Emne: [Netarchivesuite-devel] corrupt GZIP Trailer
Dear all,
we are experiencing a problem with deduplication during our Domain
Crawl. There are IOFailures due to "corrupt GZIP Trailer" by unzipping
the index which is based by thousands of previous crawled jobs.
Deduplication for the daily small crawls is working.
This is how the exception looks:
4:53:53.703 WARN dk.netarkivet.common.utils.FileUtils - Error writing
stream to file
'/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.
java.util.zip.ZipException: Corrupt GZIP trailer
at
java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:225)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:119)
at
dk.netarkivet.common.utils.LargeFileGZIPInputStream.read(LargeFileGZIPInputStream.java:67)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at
dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:862)
at
dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.unzipAndDeleteRemoteFile(IndexRequestClient.java:256)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.gunzipToDir(IndexRequestClient.java:231)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:196)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:63)
at
dk.netarkivet.harvester.indexserver.FileBasedCache.cache(FileBasedCache.java:146)
at
dk.netarkivet.harvester.indexserver.FileBasedCache.getIndex(FileBasedCache.java:203)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.getIndex(IndexRequestClient.java:63)
at
dk.netarkivet.harvester.heritrix3.HarvestJob.fetchDeduplicateIndex(HarvestJob.java:228)
at
dk.netarkivet.harvester.heritrix3.HarvestJob.writeHarvestFiles(HarvestJob.java:171)
at
dk.netarkivet.harvester.heritrix3.HarvestJob.init(HarvestJob.java:85)
at
dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:481)
14:54:03.015 WARN d.n.h.i.d.IndexRequestClient - IOFailure during
unzipping of index
dk.netarkivet.common.exceptions.IOFailure: Error writing stream to file
'/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.
at
dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:871)
at
dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)
I know about the issue with big zip files and the border of 2 GB, which
should be fixed with java 8 which we are using.
Testprogramms as mentioned in
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383 where
Bug ID: JDK-6599383 Unable to open zip files more than 2GB in size<https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383>
bugs.java.com
Component: core-libs | Sub-Component: java.util.jar
running without issues.
We had that problem in the 1st stage with nas 7.3 and now also in the
2nd stage, now using NAS 7.4.1 and with totally cleand the cache on alle
machines.
Had anybody this problem before in the last year or does anybody have an
idea what the problem could be?
Thanks for reading
Andreas
_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20220817/f657d135/attachment.html>
More information about the Netarchivesuite-devel
mailing list