[Netarchivesuite-devel] corrupt GZIP Trailer
aponb at gmx.at
aponb at gmx.at
Wed Aug 17 12:59:52 CEST 2022
Dear all,
we are experiencing a problem with deduplication during our Domain
Crawl. There are IOFailures due to "corrupt GZIP Trailer" by unzipping
the index which is based by thousands of previous crawled jobs.
Deduplication for the daily small crawls is working.
This is how the exception looks:
4:53:53.703 WARN dk.netarkivet.common.utils.FileUtils - Error writing
stream to file
'/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.
java.util.zip.ZipException: Corrupt GZIP trailer
at
java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:225)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:119)
at
dk.netarkivet.common.utils.LargeFileGZIPInputStream.read(LargeFileGZIPInputStream.java:67)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at
dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:862)
at
dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.unzipAndDeleteRemoteFile(IndexRequestClient.java:256)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.gunzipToDir(IndexRequestClient.java:231)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:196)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:63)
at
dk.netarkivet.harvester.indexserver.FileBasedCache.cache(FileBasedCache.java:146)
at
dk.netarkivet.harvester.indexserver.FileBasedCache.getIndex(FileBasedCache.java:203)
at
dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.getIndex(IndexRequestClient.java:63)
at
dk.netarkivet.harvester.heritrix3.HarvestJob.fetchDeduplicateIndex(HarvestJob.java:228)
at
dk.netarkivet.harvester.heritrix3.HarvestJob.writeHarvestFiles(HarvestJob.java:171)
at
dk.netarkivet.harvester.heritrix3.HarvestJob.init(HarvestJob.java:85)
at
dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:481)
14:54:03.015 WARN d.n.h.i.d.IndexRequestClient - IOFailure during
unzipping of index
dk.netarkivet.common.exceptions.IOFailure: Error writing stream to file
'/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.
at
dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:871)
at
dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)
I know about the issue with big zip files and the border of 2 GB, which
should be fixed with java 8 which we are using.
Testprogramms as mentioned in
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383 where
running without issues.
We had that problem in the 1st stage with nas 7.3 and now also in the
2nd stage, now using NAS 7.4.1 and with totally cleand the cache on alle
machines.
Had anybody this problem before in the last year or does anybody have an
idea what the problem could be?
Thanks for reading
Andreas
More information about the Netarchivesuite-devel
mailing list