[Netarchivesuite-users] Question about a problem with NAS QA Viewer

Navarro Guillén, Soledad soledad.navarro at bne.es
Mon Dec 21 13:02:50 CET 2015


Thank you very much Bjarne, we are going to see it with our IT Team and we'll tell you.

Regards,

Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102

De: Bjarne Andersen [mailto:bja at statsbiblioteket.dk]
Enviado el: lunes, 21 de diciembre de 2015 12:59
Para: 'netarchivesuite-users at ml.sbforge.org'
CC: Pérez Morillo, Mar; García Arratia, Juan Carlos; Archivoweb; Monzón, Fernando
Asunto: RE: Question about a problem with NAS QA Viewer

Very good question.
Im not sure NAS gets tested with compressed WARCs since the netarchive.dk always has been using non-compressed (W)ARCs.

Can you see if
/netarchive/WARC_Archive/filedir/5-metadata-1.warc
Actually looks like a WARC-file ?

And are there any entries in this file that looks like CDX-files for every .warc.gz file that was generated during the harvest.
The CDX-entries are generated by the crawler just after the crawl-finishes and it might be this code that has failed because of the gzipped WARCs

Best
Bjarne Andersen

From: NetarchiveSuite-users [mailto:netarchivesuite-users-bounces at ml.sbforge.org] On Behalf Of Navarro Guillén, Soledad
Sent: Monday, December 21, 2015 12:53 PM
To: 'netarchivesuite-users at ml.sbforge.org' <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>
Cc: Pérez Morillo, Mar <mar.perez at bne.es<mailto:mar.perez at bne.es>>; García Arratia, Juan Carlos <juancarlos.garcia at bne.es<mailto:juancarlos.garcia at bne.es>>; Archivoweb <archivoweb at bne.es<mailto:archivoweb at bne.es>>; Monzón, Fernando <f.monzon at bne.es<mailto:f.monzon at bne.es>>
Subject: [Netarchivesuite-users] Question about a problem with NAS QA Viewer

Dear all,

In the National Library of Spain Web Archive we have recently changed from NAS 4.2 to NAS 4.4 and we have a problem with the NAS QA Viewer.

Using compression in NAS 4.4 templates (changing only what is highlighted, only in the section of WARC, not of ARC) the NAS QA viewer does not work. The files generated in the harvest are the type warc.gz


        <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">true</boolean>

This is the error that appears in the graphic interface:

[cid:image001.gif at 01D138D8.B81FFAD0]


And this is the error that appears in the logs:

DETALLADO: Caught exception while running batch job on file /netarchive/WARC_Archive/filedir/5-metadata-1.warc, position 4232857:
null
java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
at java.util.regex.Matcher.reset(Matcher.java:308)
at java.util.regex.Matcher.<init>(Matcher.java:228)
at java.util.regex.Pattern.matcher(Pattern.java:1088)
at dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob.processRecord(GetMetadataArchiveBatchJob.java:95)
at dk.netarkivet.common.utils.archive.ArchiveBatchJob.processFile(ArchiveBatchJob.java:124)
at dk.netarkivet.common.utils.batch.BatchLocalFiles.processFile(BatchLocalFiles.java:168)
at dk.netarkivet.common.utils.batch.BatchLocalFiles.run(BatchLocalFiles.java:115)
at dk.netarkivet.archive.bitarchive.Bitarchive.batch(Bitarchive.java:246)
at dk.netarkivet.archive.bitarchive.distribute.BitarchiveServer$1.run(BitarchiveServer.java:428)

dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
DETALLADO: Batch: Job dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob, with arguments: URLMatcher = metadata://[^/]*/crawl/index/cdx.*, mimeMatcher = application/x-cdx finished at Mon Dec 14 11:04:50 CET 2015
dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
INFORMACIÓN: Finished batch job on bitarchive application with id '192.168.81.37_BitApp_2': 'dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob', on filename-pattern: '5-metadata-[0-9]+.(w)?arc' + with result: 1 failures in processing 1 files at 192.168.81.37_BitApp_2

Do you know if there is a way to solve it?

Thank you very much and happy Christmas,


Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20151221/27bd5102/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 5994 bytes
Desc: image001.gif
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20151221/27bd5102/attachment-0001.gif>


More information about the NetarchiveSuite-users mailing list