[Netarchivesuite-users] RV: Question about a problem with NAS QAViewer

Navarro Guillén, Soledad soledad.navarro at bne.es
Wed Feb 10 10:46:08 CET 2016


Hi Sara,

Thank you very much! We're going to try it, and we'll tell you if it works.

Best

Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102

De: sara.aubry at bnf.fr [mailto:sara.aubry at bnf.fr]
Enviado el: miércoles, 10 de febrero de 2016 10:39
Para: netarchivesuite-users at ml.sbforge.org
CC: Archivoweb; bja at statsbiblioteket.dk; Monzón, Fernando; García Arratia, Juan Carlos; Pérez Morillo, Mar; PE UCI Sistemas TSL
Asunto: RE: [Netarchivesuite-users] RV: Question about a problem with NAS QAViewer

Hi everyone,

NAS 4.4 ViewerProxy actually works on compressed WARCs. We are using it.
The lines you sent are crawl.logs entries, not CDXs

If you go on a Details for Job X page, set the proxy on your browser, then click
on the "Browse reports for jobs" link.
You should have a list of all the CRL files included in your metadata WARC file.

Then, click on a CDX link :
metadata://netarchivesuite.bnf.fr/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=28&jobid=17757&filename=BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz
it should contains such formatted lines:
http://location-vente-immobilier.leparisien.fr/1.1.237.25.56.70 20160209093941 text/html 27716 BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz 999651446 ac51351bffe7cd4a7ae5fac411ddcadb

http://www.leprogres.fr/lyon/lyon-8e145.226.55.19 20160209093940 text/html 160322 BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz 999658163 d7198ed39931a7b6d02b6b4d509d15b3

http://m.jactiv.ouest-france.fr/vie-pratique/forme-sante/comment-bien-composer-son-petit-dejeuner-13116?utm_source=of.fr&utm_medium=coldroiteflux&utm_campaign=liens
of.fr 212.95.72.4 20160209093940 text/html 59567 BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz 999686432 c2e8d7506d76f07cf4fa082f9fd1352f

If they are empty, that should explain why your ViewerProxy is not working. And if this is the case, your deduplication is not working either.
When this happens in our tests environnement, we managed to fix it by using another minor version of Java 1.6. We are still using 1.6.0_17.

Best,

Sara







De :        "Navarro Guillén,Soledad" <soledad.navarro at bne.es<mailto:soledad.navarro at bne.es>>
A :        "'bja at statsbiblioteket.dk'" <bja at statsbiblioteket.dk<mailto:bja at statsbiblioteket.dk>>
Cc :        Archivoweb <archivoweb at bne.es<mailto:archivoweb at bne.es>>, PE UCI Sistemas TSL <pe.uci.sistemas.tsl at bne.es<mailto:pe.uci.sistemas.tsl at bne.es>>, "'netarchivesuite-users at ml.sbforge.org'" <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>, "Pérez Morillo,Mar" <mar.perez at bne.es<mailto:mar.perez at bne.es>>, "García Arratia,Juan Carlos" <juancarlos.garcia at bne.es<mailto:juancarlos.garcia at bne.es>>, "Monzón,Fernando" <f.monzon at bne.es<mailto:f.monzon at bne.es>>
Date :        10/02/2016 10:06
Objet :        [Netarchivesuite-users] RV: Question about a problem with NAS QA        Viewer
Envoyé par :        NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>>
________________________________



Hi Bjarne,

Thank you very much for your reply. Below you can see the replies from our IT Team.

We hope they give you some clue that can help us to solve the problem.

Thank you for your help!
Regards,

Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102

De: Bjarne Andersen [mailto:bja at statsbiblioteket.dk]
Enviado el: lunes, 21 de diciembre de 2015 12:59
Para: 'netarchivesuite-users at ml.sbforge.org'
CC: Pérez Morillo, Mar; García Arratia, Juan Carlos; Archivoweb; Monzón, Fernando
Asunto: RE: Question about a problem with NAS QA Viewer

Very good question.
Im not sure NAS gets tested with compressed WARCs since the netarchive.dk always has been using non-compressed (W)ARCs.

Can you see if
/netarchive/WARC_Archive/filedir/5-metadata-1.warc
Actually looks like a WARC-file ?

# file /WARC/Archive_2/filedir/5-metadata-1.warc:
/WARC/Archive_2/filedir/5-metadata-1.warc: WARC Archive version 1.0\015

This is the content of the file header
-rwxrwxrwx 1 510 511 20M dic  2 10:10 /WARC/Archive_2/filedir/5-metadata-1.warc

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-12-01T16:19:57Z
WARC-Filename: 5-metadata-1.warc
WARC-Block-Digest: sha1:13007b224f5732e99238ad14ead1304f505a2ce5
WARC-Record-ID: <urn:uuid:63f0dce5-3936-4beb-a6f8-3ee02d5ea96e>
Content-Type: application/warc-fields
Content-Length: 231

software: NetarchiveSuite/Version: 4.4.1 status RELEASE/https://sbforge.org/display/NAS
ip: 192.168.81.60
hostname: HDLS005.bne.local
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
isPartOf: 5

And are there any entries in this file that looks like CDX-files for every .warc.gz file that was generated during the harvest.

These are the entries that contain /5-metadata-1.warc concerning the warc.gz files:
5-2-20151201115953-00000-HDLS005.bne.local.warc.gz -1 -1 642244893
5-2-20151201120002-00004-HDLS005.bne.local.warc.gz -1 -1 972523874
5-2-20151201115953-00002-HDLS005.bne.local.warc.gz -1 -1 733763695
5-2-20151201120002-00003-HDLS005.bne.local.warc.gz -1 -1 587720049
5-2-20151201115953-00001-HDLS005.bne.local.warc.gz -1 -1 869623474
2015-12-01 11:59:53.106 INFORMACIÓN thread-125 org.archive.io.WriterPoolMember.createFile() Opened /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00000-HDLS005.bne.local.warc.gz.open
2015-12-01 11:59:53.106 INFORMACIÓN thread-123 org.archive.io.WriterPoolMember.createFile() Opened /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00001-HDLS005.bne.local.warc.gz.open
2015-12-01 11:59:53.107 INFORMACIÓN thread-127 org.archive.io.WriterPoolMember.createFile() Opened /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00002-HDLS005.bne.local.warc.gz.open
2015-12-01 12:00:02.515 INFORMACIÓN thread-67 org.archive.io.WriterPoolMember.createFile() Opened /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00003-HDLS005.bne.local.warc.gz.open
2015-12-01 12:00:02.515 INFORMACIÓN thread-109 org.archive.io.WriterPoolMember.createFile() Opened /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00004-HDLS005.bne.local.warc.gz.open
2015-12-01 16:11:37.489 INFORMACIÓN thread-96 org.archive.io.WriterPoolMember.close() Closed /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00000-HDLS005.bne.local.warc.gz, size 642244893
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 org.archive.io.WriterPoolMember.close() Closed /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00004-HDLS005.bne.local.warc.gz, size 972523874
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 org.archive.io.WriterPoolMember.close() Closed /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00002-HDLS005.bne.local.warc.gz, size 733763695
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 org.archive.io.WriterPoolMember.close() Closed /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00003-HDLS005.bne.local.warc.gz, size 587720049
2015-12-01 16:11:37.490 INFORMACIÓN thread-96 org.archive.io.WriterPoolMember.close() Closed /netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00001-HDLS005.bne.local.warc.gz, size 869623474
WARC-Target-URI: metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00000-HDLS005.bne.local.warc.gz
WARC-Target-URI: metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201120002-00004-HDLS005.bne.local.warc.gz
WARC-Target-URI: metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00002-HDLS005.bne.local.warc.gz
WARC-Target-URI: metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201120002-00003-HDLS005.bne.local.warc.gz
WARC-Target-URI: metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00001-HDLS005.bne.local.warc.gz


The CDX-entries are generated by the crawler just after the crawl-finishes and it might be this code that has failed because of the gzipped WARCs

These are examples of the CDX entries that contain the file:
2015-12-01T12:00:04.450Z   404       2842 http://www.20minutos.es/34616581/20minutos.es/portada_Position3X http://www.20minutos.es/
text/html #020 20151201120004275+164 sha1:MMF5BQCSKN4MY35645HOLIOZ3CCCJHHE - content-size:3218
2015-12-01T12:00:04.452Z   302        198 http://www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/%20http://publi<http://www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/%20http:/publi>.
atresadvertising.com/autopromos/MPW980x90px.jpg ER http://www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/text/h
tml #069 20151201120004156+294 sha1:HKJLZMVYW736P4I4UZA3JTFPSNA3F7ER - content-size:612
2015-12-01T12:00:04.515Z   200       9722 http://www.grupo20minutos.com/img/gon.pngEXE http://www.grupo20minutos.com/contacto.htmli
mage/png #016 20151201120004474+39 sha1:23HRHBEHEB3OCXMILOEGG7YLIOALXOIV - content-size:10030
2015-12-01T12:00:04.530Z   200        246 http://www.sixtelekurs.fr/finfeed/antena3/images/bt_modulo_ibex-on.gifEE http://www.sixtel<http://www.sixtel/>
ekurs.fr/finfeed/antena3/portada_or.hts image/gif #007 20151201120004255+274 sha1:2AITMHIA45NDTX77OKJVOLOBLWNPPMIR - content-size:460
2015-12-01T12:00:04.533Z   200      37095 http://publi.atresadvertising.com/autopromos/banner_afilados_clasico_980x90px.gifERR http:
//www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/%20http://publi.atresadvertising.com/autopromos/MPW980x90px.jpg
image/gif #002 20151201120004453+77 sha1:BCB4TS63C4V7P4PJDJW5N3W66KVGTMEK - content-size:37423
2015-12-01T12:00:04.551Z   404       1245 http://logi242.xiti.com/robots.txtEEP http://logi242.xiti.com/hit.xiti?s=513357&s2=1&p=por
tada::sin_url&di=&an=&ac= text/html #034 20151201120004424+126 sha1:AS23RBWCBWELK7XKNWH7RATCJJFMDZI5 - content-size:1424
2015-12-01T12:00:04.581Z     1         67 dns:googleads.g.doubleclick.net EXP http://googleads.g.doubleclick.net/pagead/viewthroughco
nversion/941057382/?value=0&guid=ON&script=0 text/dns #041 20151201120004577+3 sha1:KNTS5M37XFDS433P5BZH4EUWAL3F6NQP - content-size:6
7
2015-12-01T12:00:04.590Z   200         26 https://download.macromedia.com/robots.txtEEP https://download.macromedia.com/pub/shockwav
e/cabs/flash/swflash.cab text/plain #050 20151201120004004+586 sha1:MNSXZO35OCDMK2YM2TS4NGM3W2BWMSDI - content-size:272
2015-12-01T12:00:04.597Z   200      27289 http://pagead2.googlesyndication.com/pagead/show_ads.jsEX http://www.google.com/recaptcha/
api/js/recaptcha_ajax.js text/javascript #100 20151201120004521+67 sha1:ZYPHXVGKBLVTSFXLPEIBSIY3OE6AEXIO - content-size:27769,3t
2015-12-01T12:00:04.597Z   200      22736 https://fonts.gstatic.com/s/montserrat/v6/IQHow_FEYlDC4Gzy_m8fcvEr6Hm6RMS0v1dtXsGir4g.ttfE
E https://fonts.googleapis.com/css?family=Montserrat:700,400font/ttf #033 20151201120004408+186 sha1:J66G67DFB45TQSWJNXLUZANXI2KSSRR
M - content-size:23225

Best
Bjarne Andersen

From: NetarchiveSuite-users [mailto:netarchivesuite-users-bounces at ml.sbforge.org] On Behalf Of Navarro Guillén, Soledad
Sent: Monday, December 21, 2015 12:53 PM
To: 'netarchivesuite-users at ml.sbforge.org' <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>
Cc: Pérez Morillo, Mar <mar.perez at bne.es<mailto:mar.perez at bne.es>>; García Arratia, Juan Carlos <juancarlos.garcia at bne.es<mailto:juancarlos.garcia at bne.es>>; Archivoweb <archivoweb at bne.es<mailto:archivoweb at bne.es>>; Monzón, Fernando <f.monzon at bne.es<mailto:f.monzon at bne.es>>
Subject: [Netarchivesuite-users] Question about a problem with NAS QA Viewer

Dear all,

In the National Library of Spain Web Archive we have recently changed from NAS 4.2 to NAS 4.4 and we have a problem with the NAS QA Viewer.

Using compression in NAS 4.4 templates (changing only what is highlighted, only in the section of WARC, not of ARC) the NAS QA viewer does not work. The files generated in the harvest are the type warc.gz


       <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
         <map name="rules">
         </map>
       </newObject>
       <boolean name="compress">true</boolean>

This is the error that appears in the graphic interface:

[cid:image001.gif at 01D138D8.B81FFAD0]


And this is the error that appears in the logs:

DETALLADO: Caught exception while running batch job on file /netarchive/WARC_Archive/filedir/5-metadata-1.warc, position 4232857:
null
java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
at java.util.regex.Matcher.reset(Matcher.java:308)
at java.util.regex.Matcher.<init>(Matcher.java:228)
at java.util.regex.Pattern.matcher(Pattern.java:1088)
at dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob.processRecord(GetMetadataArchiveBatchJob.java:95)
at dk.netarkivet.common.utils.archive.ArchiveBatchJob.processFile(ArchiveBatchJob.java:124)
at dk.netarkivet.common.utils.batch.BatchLocalFiles.processFile(BatchLocalFiles.java:168)
at dk.netarkivet.common.utils.batch.BatchLocalFiles.run(BatchLocalFiles.java:115)
at dk.netarkivet.archive.bitarchive.Bitarchive.batch(Bitarchive.java:246)
at dk.netarkivet.archive.bitarchive.distribute.BitarchiveServer$1.run(BitarchiveServer.java:428)

dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
DETALLADO: Batch: Job dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob, with arguments: URLMatcher = metadata://[^/]*/crawl/index/cdx.*, mimeMatcher = application/x-cdx finished at Mon Dec 14 11:04:50 CET 2015
dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
INFORMACIÓN: Finished batch job on bitarchive application with id '192.168.81.37_BitApp_2': 'dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob', on filename-pattern: '5-metadata-[0-9]+.(w)?arc' + with result: 1 failures in processing 1 files at 192.168.81.37_BitApp_2

Do you know if there is a way to solve it?

Thank you very much and happy Christmas,


Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102

 _______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
________________________________

Exposition De Rouge et de Noir. Les vases grecs de la collection de Luynes<http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.vases_grecs.html> - jusqu'au 1er mars 2016 - BnF - Richelieu

Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20160210/4dd6a34e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 5994 bytes
Desc: image001.gif
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20160210/4dd6a34e/attachment-0001.gif>


More information about the NetarchiveSuite-users mailing list