[Netarchivesuite-users] Question about a problem with NAS QA Viewer

Bjarne Andersen bja at statsbiblioteket.dk
Mon Dec 21 12:58:40 CET 2015

Very good question.
Im not sure NAS gets tested with compressed WARCs since the netarchive.dk always has been using non-compressed (W)ARCs.

Can you see if
Actually looks like a WARC-file ?

And are there any entries in this file that looks like CDX-files for every .warc.gz file that was generated during the harvest.
The CDX-entries are generated by the crawler just after the crawl-finishes and it might be this code that has failed because of the gzipped WARCs

Bjarne Andersen

From: NetarchiveSuite-users [mailto:netarchivesuite-users-bounces at ml.sbforge.org] On Behalf Of Navarro Guillén, Soledad
Sent: Monday, December 21, 2015 12:53 PM
To: 'netarchivesuite-users at ml.sbforge.org' <netarchivesuite-users at ml.sbforge.org>
Cc: Pérez Morillo, Mar <mar.perez at bne.es>; García Arratia, Juan Carlos <juancarlos.garcia at bne.es>; Archivoweb <archivoweb at bne.es>; Monzón, Fernando <f.monzon at bne.es>
Subject: [Netarchivesuite-users] Question about a problem with NAS QA Viewer

Dear all,

In the National Library of Spain Web Archive we have recently changed from NAS 4.2 to NAS 4.4 and we have a problem with the NAS QA Viewer.

Using compression in NAS 4.4 templates (changing only what is highlighted, only in the section of WARC, not of ARC) the NAS QA viewer does not work. The files generated in the harvest are the type warc.gz

        <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
        <boolean name="compress">true</boolean>

This is the error that appears in the graphic interface:

[cid:image001.gif at 01D138D8.B81FFAD0]

And this is the error that appears in the logs:

DETALLADO: Caught exception while running batch job on file /netarchive/WARC_Archive/filedir/5-metadata-1.warc, position 4232857:
at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
at java.util.regex.Matcher.reset(Matcher.java:308)
at java.util.regex.Matcher.<init>(Matcher.java:228)
at java.util.regex.Pattern.matcher(Pattern.java:1088)
at dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob.processRecord(GetMetadataArchiveBatchJob.java:95)
at dk.netarkivet.common.utils.archive.ArchiveBatchJob.processFile(ArchiveBatchJob.java:124)
at dk.netarkivet.common.utils.batch.BatchLocalFiles.processFile(BatchLocalFiles.java:168)
at dk.netarkivet.common.utils.batch.BatchLocalFiles.run(BatchLocalFiles.java:115)
at dk.netarkivet.archive.bitarchive.Bitarchive.batch(Bitarchive.java:246)
at dk.netarkivet.archive.bitarchive.distribute.BitarchiveServer$1.run(BitarchiveServer.java:428)

dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
DETALLADO: Batch: Job dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob, with arguments: URLMatcher = metadata://[^/]*/crawl/index/cdx.*, mimeMatcher = application/x-cdx finished at Mon Dec 14 11:04:50 CET 2015
dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
INFORMACIÓN: Finished batch job on bitarchive application with id '': 'dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob', on filename-pattern: '5-metadata-[0-9]+.(w)?arc' + with result: 1 failures in processing 1 files at

Do you know if there is a way to solve it?

Thank you very much and happy Christmas,

Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20151221/e8af3c84/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 5994 bytes
Desc: image001.gif
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20151221/e8af3c84/attachment-0001.gif>

More information about the NetarchiveSuite-users mailing list