[Netarchivesuite-users] RV: Question about a problem with NAS QAViewer

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Feb 10 10:38:40 CET 2016


Hi everyone,

NAS 4.4 ViewerProxy actually works on compressed WARCs. We are using it.
The lines you sent are crawl.logs entries, not CDXs 

If you go on a Details for Job X page, set the proxy on your browser, then 
click 
on the "Browse reports for jobs" link.
You should have a list of all the CRL files included in your metadata WARC 
file.

Then, click on a CDX link :
metadata://netarchivesuite.bnf.fr/crawl/index/cdx?
majorversion=2&minorversion=0&harvestid=28&jobid=17757&filename=BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz
it should contains such formatted lines:
http://location-vente-immobilier.leparisien.fr/1.1.2 37.25.56.70 
20160209093941 text/html 27716 
BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz 
999651446 ac51351bffe7cd4a7ae5fac411ddcadb

http://www.leprogres.fr/lyon/lyon-8e 145.226.55.19 20160209093940 
text/html 160322 
BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz 
999658163 d7198ed39931a7b6d02b6b4d509d15b3

http://m.jactiv.ouest-france.fr/vie-pratique/forme-sante/comment-bien-composer-son-petit-dejeuner-13116?utm_source=of.fr&utm_medium=coldroiteflux&utm_campaign=liens
of.fr 212.95.72.4 20160209093940 text/html 59567 
BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz 
999686432 c2e8d7506d76f07cf4fa082f9fd1352f

If they are empty, that should explain why your ViewerProxy is not 
working. And if this is the case, your deduplication is not working 
either.
When this happens in our tests environnement, we managed to fix it by 
using another minor version of Java 1.6. We are still using 1.6.0_17.

Best,

Sara







De :    "Navarro Guillén,Soledad" <soledad.navarro at bne.es>
A :     "'bja at statsbiblioteket.dk'" <bja at statsbiblioteket.dk>
Cc :    Archivoweb <archivoweb at bne.es>, PE UCI Sistemas TSL 
<pe.uci.sistemas.tsl at bne.es>, "'netarchivesuite-users at ml.sbforge.org'" 
<netarchivesuite-users at ml.sbforge.org>, "Pérez Morillo,Mar" 
<mar.perez at bne.es>, "García Arratia,Juan Carlos" 
<juancarlos.garcia at bne.es>, "Monzón,Fernando" <f.monzon at bne.es>
Date :  10/02/2016 10:06
Objet : [Netarchivesuite-users] RV: Question about a problem with NAS QA 
Viewer
Envoyé par :    NetarchiveSuite-users 
<netarchivesuite-users-bounces at ml.sbforge.org>



Hi Bjarne,
 
Thank you very much for your reply. Below you can see the replies from our 
IT Team. 
 
We hope they give you some clue that can help us to solve the problem.
 
Thank you for your help!
Regards,
 
Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102
 
De: Bjarne Andersen [mailto:bja at statsbiblioteket.dk] 
Enviado el: lunes, 21 de diciembre de 2015 12:59
Para: 'netarchivesuite-users at ml.sbforge.org'
CC: Pérez Morillo, Mar; García Arratia, Juan Carlos; Archivoweb; Monzón, 
Fernando
Asunto: RE: Question about a problem with NAS QA Viewer
 
Very good question.
Im not sure NAS gets tested with compressed WARCs since the netarchive.dk 
always has been using non-compressed (W)ARCs.
 
Can you see if
/netarchive/WARC_Archive/filedir/5-metadata-1.warc
Actually looks like a WARC-file ?
 
# file /WARC/Archive_2/filedir/5-metadata-1.warc:
/WARC/Archive_2/filedir/5-metadata-1.warc: WARC Archive version 1.0\015

This is the content of the file header
-rwxrwxrwx 1 510 511 20M dic  2 10:10 
/WARC/Archive_2/filedir/5-metadata-1.warc

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-12-01T16:19:57Z
WARC-Filename: 5-metadata-1.warc
WARC-Block-Digest: sha1:13007b224f5732e99238ad14ead1304f505a2ce5
WARC-Record-ID: <urn:uuid:63f0dce5-3936-4beb-a6f8-3ee02d5ea96e>
Content-Type: application/warc-fields
Content-Length: 231

software: NetarchiveSuite/Version: 4.4.1 status RELEASE/
https://sbforge.org/display/NAS
ip: 192.168.81.60
hostname: HDLS005.bne.local
conformsTo: 
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
isPartOf: 5

And are there any entries in this file that looks like CDX-files for every 
.warc.gz file that was generated during the harvest.
 
These are the entries that contain /5-metadata-1.warc concerning the 
warc.gz files:
5-2-20151201115953-00000-HDLS005.bne.local.warc.gz -1 -1 642244893
5-2-20151201120002-00004-HDLS005.bne.local.warc.gz -1 -1 972523874
5-2-20151201115953-00002-HDLS005.bne.local.warc.gz -1 -1 733763695
5-2-20151201120002-00003-HDLS005.bne.local.warc.gz -1 -1 587720049
5-2-20151201115953-00001-HDLS005.bne.local.warc.gz -1 -1 869623474
2015-12-01 11:59:53.106 INFORMACIÓN thread-125 
org.archive.io.WriterPoolMember.createFile() Opened 
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00000-HDLS005.bne.local.warc.gz.open
2015-12-01 11:59:53.106 INFORMACIÓN thread-123 
org.archive.io.WriterPoolMember.createFile() Opened 
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00001-HDLS005.bne.local.warc.gz.open
2015-12-01 11:59:53.107 INFORMACIÓN thread-127 
org.archive.io.WriterPoolMember.createFile() Opened 
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00002-HDLS005.bne.local.warc.gz.open
2015-12-01 12:00:02.515 INFORMACIÓN thread-67 
org.archive.io.WriterPoolMember.createFile() Opened 
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00003-HDLS005.bne.local.warc.gz.open
2015-12-01 12:00:02.515 INFORMACIÓN thread-109 
org.archive.io.WriterPoolMember.createFile() Opened 
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00004-HDLS005.bne.local.warc.gz.open
2015-12-01 16:11:37.489 INFORMACIÓN thread-96 
org.archive.io.WriterPoolMember.close() Closed 
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00000-HDLS005.bne.local.warc.gz, 
size 642244893
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 
org.archive.io.WriterPoolMember.close() Closed 
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00004-HDLS005.bne.local.warc.gz, 
size 972523874
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 
org.archive.io.WriterPoolMember.close() Closed 
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00002-HDLS005.bne.local.warc.gz, 
size 733763695
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 
org.archive.io.WriterPoolMember.close() Closed 
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00003-HDLS005.bne.local.warc.gz, 
size 587720049
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 
org.archive.io.WriterPoolMember.close() Closed 
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00001-HDLS005.bne.local.warc.gz, 
size 869623474
WARC-Target-URI: 
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00000-HDLS005.bne.local.warc.gz
WARC-Target-URI: 
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201120002-00004-HDLS005.bne.local.warc.gz
WARC-Target-URI: 
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00002-HDLS005.bne.local.warc.gz
WARC-Target-URI: 
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201120002-00003-HDLS005.bne.local.warc.gz
WARC-Target-URI: 
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00001-HDLS005.bne.local.warc.gz



The CDX-entries are generated by the crawler just after the crawl-finishes 
and it might be this code that has failed because of the gzipped WARCs

These are examples of the CDX entries that contain the file:
2015-12-01T12:00:04.450Z   404       2842 
http://www.20minutos.es/34616581/20minutos.es/portada_Position3 X 
http://www.20minutos.es/ 
text/html #020 20151201120004275+164 sha1:MMF5BQCSKN4MY35645HOLIOZ3CCCJHHE 
- content-size:3218
2015-12-01T12:00:04.452Z   302        198 
http://www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/%20http://publi
.
atresadvertising.com/autopromos/MPW980x90px.jpg ER 
http://www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/ 
text/h
tml #069 20151201120004156+294 sha1:HKJLZMVYW736P4I4UZA3JTFPSNA3F7ER - 
content-size:612
2015-12-01T12:00:04.515Z   200       9722 
http://www.grupo20minutos.com/img/gon.png EXE 
http://www.grupo20minutos.com/contacto.html i
mage/png #016 20151201120004474+39 sha1:23HRHBEHEB3OCXMILOEGG7YLIOALXOIV - 
content-size:10030
2015-12-01T12:00:04.530Z   200        246 
http://www.sixtelekurs.fr/finfeed/antena3/images/bt_modulo_ibex-on.gif EE 
http://www.sixtel
ekurs.fr/finfeed/antena3/portada_or.hts image/gif #007 
20151201120004255+274 sha1:2AITMHIA45NDTX77OKJVOLOBLWNPPMIR - 
content-size:460
2015-12-01T12:00:04.533Z   200      37095 
http://publi.atresadvertising.com/autopromos/banner_afilados_clasico_980x90px.gif 
ERR http:
//www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/%20
http://publi.atresadvertising.com/autopromos/MPW980x90px.jpg
image/gif #002 20151201120004453+77 sha1:BCB4TS63C4V7P4PJDJW5N3W66KVGTMEK 
- content-size:37423
2015-12-01T12:00:04.551Z   404       1245 
http://logi242.xiti.com/robots.txt EEP 
http://logi242.xiti.com/hit.xiti?s=513357&s2=1&p=por
tada::sin_url&di=&an=&ac= text/html #034 20151201120004424+126 
sha1:AS23RBWCBWELK7XKNWH7RATCJJFMDZI5 - content-size:1424
2015-12-01T12:00:04.581Z     1         67 dns:googleads.g.doubleclick.net 
EXP http://googleads.g.doubleclick.net/pagead/viewthroughco
nversion/941057382/?value=0&guid=ON&script=0 text/dns #041 
20151201120004577+3 sha1:KNTS5M37XFDS433P5BZH4EUWAL3F6NQP - content-size:6
7
2015-12-01T12:00:04.590Z   200         26 
https://download.macromedia.com/robots.txt EEP 
https://download.macromedia.com/pub/shockwav
e/cabs/flash/swflash.cab text/plain #050 20151201120004004+586 
sha1:MNSXZO35OCDMK2YM2TS4NGM3W2BWMSDI - content-size:272
2015-12-01T12:00:04.597Z   200      27289 
http://pagead2.googlesyndication.com/pagead/show_ads.js EX 
http://www.google.com/recaptcha/
api/js/recaptcha_ajax.js text/javascript #100 20151201120004521+67 
sha1:ZYPHXVGKBLVTSFXLPEIBSIY3OE6AEXIO - content-size:27769,3t
2015-12-01T12:00:04.597Z   200      22736 
https://fonts.gstatic.com/s/montserrat/v6/IQHow_FEYlDC4Gzy_m8fcvEr6Hm6RMS0v1dtXsGir4g.ttf 
E
E https://fonts.googleapis.com/css?family=Montserrat:700,400 font/ttf #033 
20151201120004408+186 sha1:J66G67DFB45TQSWJNXLUZANXI2KSSRR
M - content-size:23225

Best
Bjarne Andersen
 
From: NetarchiveSuite-users [
mailto:netarchivesuite-users-bounces at ml.sbforge.org] On Behalf Of Navarro 
Guillén, Soledad
Sent: Monday, December 21, 2015 12:53 PM
To: 'netarchivesuite-users at ml.sbforge.org' <
netarchivesuite-users at ml.sbforge.org>
Cc: Pérez Morillo, Mar <mar.perez at bne.es>; García Arratia, Juan Carlos <
juancarlos.garcia at bne.es>; Archivoweb <archivoweb at bne.es>; Monzón, 
Fernando <f.monzon at bne.es>
Subject: [Netarchivesuite-users] Question about a problem with NAS QA 
Viewer
 
Dear all,
 
In the National Library of Spain Web Archive we have recently changed from 
NAS 4.2 to NAS 4.4 and we have a problem with the NAS QA Viewer.
 
Using compression in NAS 4.4 templates (changing only what is highlighted, 
only in the section of WARC, not of ARC) the NAS QA viewer does not work. 
The files generated in the harvest are the type warc.gz


        <newObject name="WARCArchiver#decide-rules" 
class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">true</boolean>

This is the error that appears in the graphic interface:




And this is the error that appears in the logs:

DETALLADO: Caught exception while running batch job on file 
/netarchive/WARC_Archive/filedir/5-metadata-1.warc, position 4232857:
null
java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
at java.util.regex.Matcher.reset(Matcher.java:308)
at java.util.regex.Matcher.<init>(Matcher.java:228)
at java.util.regex.Pattern.matcher(Pattern.java:1088)
at 
dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob.processRecord(GetMetadataArchiveBatchJob.java:95)
at 
dk.netarkivet.common.utils.archive.ArchiveBatchJob.processFile(ArchiveBatchJob.java:124)
at 
dk.netarkivet.common.utils.batch.BatchLocalFiles.processFile(BatchLocalFiles.java:168)
at 
dk.netarkivet.common.utils.batch.BatchLocalFiles.run(BatchLocalFiles.java:115)
at dk.netarkivet.archive.bitarchive.Bitarchive.batch(Bitarchive.java:246)
at 
dk.netarkivet.archive.bitarchive.distribute.BitarchiveServer$1.run(BitarchiveServer.java:428)


dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
DETALLADO: Batch: Job 
dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob, with 
arguments: URLMatcher = metadata://[^/]*/crawl/index/cdx.*, mimeMatcher = 
application/x-cdx finished at Mon Dec 14 11:04:50 CET 2015
dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
INFORMACIÓN: Finished batch job on bitarchive application with id 
'192.168.81.37_BitApp_2': 
'dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob', on 
filename-pattern: '5-metadata-[0-9]+.(w)?arc' + with result: 1 failures in 
processing 1 files at 192.168.81.37_BitApp_2
 
Do you know if there is a way to solve it?
 
Thank you very much and happy Christmas,
 
 
Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102
 
 _______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users


Exposition  De Rouge et de Noir. Les vases grecs de la collection de Luynes  - jusqu'au 1 er  mars 2016 - BnF - Richelieu Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20160210/b440c874/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 5994 bytes
Desc: not available
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20160210/b440c874/attachment-0001.gif>


More information about the NetarchiveSuite-users mailing list