[Netarchivesuite-users] RV: Question about a problem with NAS QA Viewer

Navarro Guillén, Soledad soledad.navarro at bne.es
Wed Feb 10 10:05:49 CET 2016


Hi Bjarne,

Thank you very much for your reply. Below you can see the replies from our IT Team.

We hope they give you some clue that can help us to solve the problem.

Thank you for your help!
Regards,

Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102

De: Bjarne Andersen [mailto:bja at statsbiblioteket.dk]
Enviado el: lunes, 21 de diciembre de 2015 12:59
Para: 'netarchivesuite-users at ml.sbforge.org'
CC: Pérez Morillo, Mar; García Arratia, Juan Carlos; Archivoweb; Monzón, Fernando
Asunto: RE: Question about a problem with NAS QA Viewer

Very good question.
Im not sure NAS gets tested with compressed WARCs since the netarchive.dk always has been using non-compressed (W)ARCs.

Can you see if
/netarchive/WARC_Archive/filedir/5-metadata-1.warc
Actually looks like a WARC-file ?

# file /WARC/Archive_2/filedir/5-metadata-1.warc:
/WARC/Archive_2/filedir/5-metadata-1.warc: WARC Archive version 1.0\015

This is the content of the file header
-rwxrwxrwx 1 510 511 20M dic  2 10:10 /WARC/Archive_2/filedir/5-metadata-1.warc

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-12-01T16:19:57Z
WARC-Filename: 5-metadata-1.warc
WARC-Block-Digest: sha1:13007b224f5732e99238ad14ead1304f505a2ce5
WARC-Record-ID: <urn:uuid:63f0dce5-3936-4beb-a6f8-3ee02d5ea96e>
Content-Type: application/warc-fields
Content-Length: 231

software: NetarchiveSuite/Version: 4.4.1 status RELEASE/https://sbforge.org/display/NAS
ip: 192.168.81.60
hostname: HDLS005.bne.local
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
isPartOf: 5

And are there any entries in this file that looks like CDX-files for every .warc.gz file that was generated during the harvest.

These are the entries that contain /5-metadata-1.warc concerning the warc.gz files:
5-2-20151201115953-00000-HDLS005.bne.local.warc.gz -1 -1 642244893
5-2-20151201120002-00004-HDLS005.bne.local.warc.gz -1 -1 972523874
5-2-20151201115953-00002-HDLS005.bne.local.warc.gz -1 -1 733763695
5-2-20151201120002-00003-HDLS005.bne.local.warc.gz -1 -1 587720049
5-2-20151201115953-00001-HDLS005.bne.local.warc.gz -1 -1 869623474
2015-12-01 11:59:53.106 INFORMACIÓN thread-125 org.archive.io.WriterPoolMember.createFile() Opened /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00000-HDLS005.bne.local.warc.gz.open
2015-12-01 11:59:53.106 INFORMACIÓN thread-123 org.archive.io.WriterPoolMember.createFile() Opened /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00001-HDLS005.bne.local.warc.gz.open
2015-12-01 11:59:53.107 INFORMACIÓN thread-127 org.archive.io.WriterPoolMember.createFile() Opened /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00002-HDLS005.bne.local.warc.gz.open
2015-12-01 12:00:02.515 INFORMACIÓN thread-67 org.archive.io.WriterPoolMember.createFile() Opened /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00003-HDLS005.bne.local.warc.gz.open
2015-12-01 12:00:02.515 INFORMACIÓN thread-109 org.archive.io.WriterPoolMember.createFile() Opened /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00004-HDLS005.bne.local.warc.gz.open
2015-12-01 16:11:37.489 INFORMACIÓN thread-96 org.archive.io.WriterPoolMember.close() Closed /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00000-HDLS005.bne.local.warc.gz, size 642244893
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 org.archive.io.WriterPoolMember.close() Closed /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00004-HDLS005.bne.local.warc.gz, size 972523874
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 org.archive.io.WriterPoolMember.close() Closed /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00002-HDLS005.bne.local.warc.gz, size 733763695
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 org.archive.io.WriterPoolMember.close() Closed /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00003-HDLS005.bne.local.warc.gz, size 587720049
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 org.archive.io.WriterPoolMember.close() Closed /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00001-HDLS005.bne.local.warc.gz, size 869623474
WARC-Target-URI: metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00000-HDLS005.bne.local.warc.gz
WARC-Target-URI: metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201120002-00004-HDLS005.bne.local.warc.gz
WARC-Target-URI: metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00002-HDLS005.bne.local.warc.gz
WARC-Target-URI: metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201120002-00003-HDLS005.bne.local.warc.gz
WARC-Target-URI: metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00001-HDLS005.bne.local.warc.gz


The CDX-entries are generated by the crawler just after the crawl-finishes and it might be this code that has failed because of the gzipped WARCs

These are examples of the CDX entries that contain the file:
2015-12-01T12:00:04.450Z   404       2842 http://www.20minutos.es/34616581/20minutos.es/portada_Position3 X http://www.20minutos.es/
text/html #020 20151201120004275+164 sha1:MMF5BQCSKN4MY35645HOLIOZ3CCCJHHE - content-size:3218
2015-12-01T12:00:04.452Z   302        198 http://www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/%20http://publi<http://www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/%20http:/publi>.
atresadvertising.com/autopromos/MPW980x90px.jpg ER http://www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/ text/h
tml #069 20151201120004156+294 sha1:HKJLZMVYW736P4I4UZA3JTFPSNA3F7ER - content-size:612
2015-12-01T12:00:04.515Z   200       9722 http://www.grupo20minutos.com/img/gon.png EXE http://www.grupo20minutos.com/contacto.html i
mage/png #016 20151201120004474+39 sha1:23HRHBEHEB3OCXMILOEGG7YLIOALXOIV - content-size:10030
2015-12-01T12:00:04.530Z   200        246 http://www.sixtelekurs.fr/finfeed/antena3/images/bt_modulo_ibex-on.gif EE http://www.sixtel
ekurs.fr/finfeed/antena3/portada_or.hts image/gif #007 20151201120004255+274 sha1:2AITMHIA45NDTX77OKJVOLOBLWNPPMIR - content-size:460
2015-12-01T12:00:04.533Z   200      37095 http://publi.atresadvertising.com/autopromos/banner_afilados_clasico_980x90px.gif ERR http:
//www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/%20http://publi.atresadvertising.com/autopromos/MPW980x90px.jpg
image/gif #002 20151201120004453+77 sha1:BCB4TS63C4V7P4PJDJW5N3W66KVGTMEK - content-size:37423
2015-12-01T12:00:04.551Z   404       1245 http://logi242.xiti.com/robots.txt EEP http://logi242.xiti.com/hit.xiti?s=513357&s2=1&p=por
tada::sin_url&di=&an=&ac= text/html #034 20151201120004424+126 sha1:AS23RBWCBWELK7XKNWH7RATCJJFMDZI5 - content-size:1424
2015-12-01T12:00:04.581Z     1         67 dns:googleads.g.doubleclick.net EXP http://googleads.g.doubleclick.net/pagead/viewthroughco
nversion/941057382/?value=0&guid=ON&script=0 text/dns #041 20151201120004577+3 sha1:KNTS5M37XFDS433P5BZH4EUWAL3F6NQP - content-size:6
7
2015-12-01T12:00:04.590Z   200         26 https://download.macromedia.com/robots.txt EEP https://download.macromedia.com/pub/shockwav
e/cabs/flash/swflash.cab text/plain #050 20151201120004004+586 sha1:MNSXZO35OCDMK2YM2TS4NGM3W2BWMSDI - content-size:272
2015-12-01T12:00:04.597Z   200      27289 http://pagead2.googlesyndication.com/pagead/show_ads.js EX http://www.google.com/recaptcha/
api/js/recaptcha_ajax.js text/javascript #100 20151201120004521+67 sha1:ZYPHXVGKBLVTSFXLPEIBSIY3OE6AEXIO - content-size:27769,3t
2015-12-01T12:00:04.597Z   200      22736 https://fonts.gstatic.com/s/montserrat/v6/IQHow_FEYlDC4Gzy_m8fcvEr6Hm6RMS0v1dtXsGir4g.ttf E
E https://fonts.googleapis.com/css?family=Montserrat:700,400 font/ttf #033 20151201120004408+186 sha1:J66G67DFB45TQSWJNXLUZANXI2KSSRR
M - content-size:23225

Best
Bjarne Andersen

From: NetarchiveSuite-users [mailto:netarchivesuite-users-bounces at ml.sbforge.org] On Behalf Of Navarro Guillén, Soledad
Sent: Monday, December 21, 2015 12:53 PM
To: 'netarchivesuite-users at ml.sbforge.org' <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>
Cc: Pérez Morillo, Mar <mar.perez at bne.es<mailto:mar.perez at bne.es>>; García Arratia, Juan Carlos <juancarlos.garcia at bne.es<mailto:juancarlos.garcia at bne.es>>; Archivoweb <archivoweb at bne.es<mailto:archivoweb at bne.es>>; Monzón, Fernando <f.monzon at bne.es<mailto:f.monzon at bne.es>>
Subject: [Netarchivesuite-users] Question about a problem with NAS QA Viewer

Dear all,

In the National Library of Spain Web Archive we have recently changed from NAS 4.2 to NAS 4.4 and we have a problem with the NAS QA Viewer.

Using compression in NAS 4.4 templates (changing only what is highlighted, only in the section of WARC, not of ARC) the NAS QA viewer does not work. The files generated in the harvest are the type warc.gz


        <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">true</boolean>

This is the error that appears in the graphic interface:

[cid:image001.gif at 01D138D8.B81FFAD0]


And this is the error that appears in the logs:

DETALLADO: Caught exception while running batch job on file /netarchive/WARC_Archive/filedir/5-metadata-1.warc, position 4232857:
null
java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
at java.util.regex.Matcher.reset(Matcher.java:308)
at java.util.regex.Matcher.<init>(Matcher.java:228)
at java.util.regex.Pattern.matcher(Pattern.java:1088)
at dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob.processRecord(GetMetadataArchiveBatchJob.java:95)
at dk.netarkivet.common.utils.archive.ArchiveBatchJob.processFile(ArchiveBatchJob.java:124)
at dk.netarkivet.common.utils.batch.BatchLocalFiles.processFile(BatchLocalFiles.java:168)
at dk.netarkivet.common.utils.batch.BatchLocalFiles.run(BatchLocalFiles.java:115)
at dk.netarkivet.archive.bitarchive.Bitarchive.batch(Bitarchive.java:246)
at dk.netarkivet.archive.bitarchive.distribute.BitarchiveServer$1.run(BitarchiveServer.java:428)

dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
DETALLADO: Batch: Job dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob, with arguments: URLMatcher = metadata://[^/]*/crawl/index/cdx.*, mimeMatcher = application/x-cdx finished at Mon Dec 14 11:04:50 CET 2015
dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
INFORMACIÓN: Finished batch job on bitarchive application with id '192.168.81.37_BitApp_2': 'dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob', on filename-pattern: '5-metadata-[0-9]+.(w)?arc' + with result: 1 failures in processing 1 files at 192.168.81.37_BitApp_2

Do you know if there is a way to solve it?

Thank you very much and happy Christmas,


Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20160210/cda63103/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 5994 bytes
Desc: image001.gif
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20160210/cda63103/attachment-0001.gif>


More information about the NetarchiveSuite-users mailing list