[Netarchivesuite-devel] corrupt GZIP Trailer

aponb at gmx.at aponb at gmx.at
Thu Aug 18 09:44:17 CEST 2022


Hi Tue,

thanks for your answer.
So I am running Indexserver and Harvester on CentOs 7.9.2009 with
openjdk version "1.8.0_202"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_202-b08)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.202-b08, mixed mode

first I was also thinking of a Harddisk Problem. But on all Harvester
Machines simultaneously ...

Thanks for checking!



Am 17.08.22 um 14:00 schrieb Tue Hejlskov Larsen:
> No  - I ran through all our current 7.3 harvester logs and Indexserver log
>
> In the latest broadcrawls the dedup indexes was between 85 - 107 G unzipped withou any error.
>
>
> Index server java version on RHEL6
>
> java version "1.8.0_91"
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
>
> Harvester java version on RHEL6 - RHEL8
> [prod at kb-prod-har-001 ~]$ java -version
> java version "1.8.0_151"
> Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
>
> Best regards
> Tue
>
>
>
> ________________________________
> Fra: Netarchivesuite-devel <netarchivesuite-devel-bounces at ml.sbforge.org> på vegne af aponb at gmx.at <aponb at gmx.at>
> Sendt: 17. august 2022 12:59
> Til: netarchivesuite-devel at ml.sbforge.org
> Emne: [Netarchivesuite-devel] corrupt GZIP Trailer
>
> Dear all,
>
> we are experiencing a problem with deduplication during our Domain
> Crawl. There are IOFailures due to "corrupt GZIP Trailer" by unzipping
> the index which is based by thousands of previous crawled jobs.
> Deduplication for the daily small crawls is working.
>
> This is how the exception looks:
>
> 4:53:53.703 WARN  dk.netarkivet.common.utils.FileUtils - Error writing
> stream to file
> '/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.
> java.util.zip.ZipException: Corrupt GZIP trailer
>           at
> java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:225)
>           at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:119)
>           at
> dk.netarkivet.common.utils.LargeFileGZIPInputStream.read(LargeFileGZIPInputStream.java:67)
>           at java.io.FilterInputStream.read(FilterInputStream.java:107)
>           at
> dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:862)
>           at
> dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)
>           at
> dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.unzipAndDeleteRemoteFile(IndexRequestClient.java:256)
>           at
> dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.gunzipToDir(IndexRequestClient.java:231)
>           at
> dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:196)
>           at
> dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:63)
>           at
> dk.netarkivet.harvester.indexserver.FileBasedCache.cache(FileBasedCache.java:146)
>           at
> dk.netarkivet.harvester.indexserver.FileBasedCache.getIndex(FileBasedCache.java:203)
>           at
> dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.getIndex(IndexRequestClient.java:63)
>           at
> dk.netarkivet.harvester.heritrix3.HarvestJob.fetchDeduplicateIndex(HarvestJob.java:228)
>           at
> dk.netarkivet.harvester.heritrix3.HarvestJob.writeHarvestFiles(HarvestJob.java:171)
>           at
> dk.netarkivet.harvester.heritrix3.HarvestJob.init(HarvestJob.java:85)
>           at
> dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:481)
> 14:54:03.015 WARN  d.n.h.i.d.IndexRequestClient - IOFailure during
> unzipping of index
> dk.netarkivet.common.exceptions.IOFailure: Error writing stream to file
> '/data/nas/cache/DEDUP_CRAWL_LOG/122724-122725-122726-122727-e5257b225fa6913a97a903499b63f9d1-cache6598946241324751116.tmp/_bv.fdt'.
>           at
> dk.netarkivet.common.utils.FileUtils.writeStreamToFile(FileUtils.java:871)
>           at
> dk.netarkivet.common.utils.ZipUtils.gunzipFile(ZipUtils.java:272)
>
> I know about the issue with big zip files and the border of 2 GB, which
> should be fixed with java 8 which we are using.
> Testprogramms as mentioned in
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383 where
> Bug ID: JDK-6599383 Unable to open zip files more than 2GB in size<https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6599383>
> bugs.java.com
> Component: core-libs | Sub-Component: java.util.jar
>
>
>
> running without issues.
>
> We had that problem in the 1st stage with nas 7.3 and now also in the
> 2nd stage, now using NAS 7.4.1 and with totally cleand the cache on alle
> machines.
>
> Had anybody this problem before in the last year or does anybody have an
> idea what the problem could be?
>
> Thanks for reading
> Andreas
>
> _______________________________________________
> Netarchivesuite-devel mailing list
> Netarchivesuite-devel at ml.sbforge.org
> https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel
>
>
> _______________________________________________
> Netarchivesuite-devel mailing list
> Netarchivesuite-devel at ml.sbforge.org
> https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20220818/d1644fe9/attachment-0001.html>


More information about the Netarchivesuite-devel mailing list