[Netarchivesuite-users] RV: Question about a problem with NAS QAViewer
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Wed Feb 10 10:38:40 CET 2016
Hi everyone,
NAS 4.4 ViewerProxy actually works on compressed WARCs. We are using it.
The lines you sent are crawl.logs entries, not CDXs
If you go on a Details for Job X page, set the proxy on your browser, then
click
on the "Browse reports for jobs" link.
You should have a list of all the CRL files included in your metadata WARC
file.
Then, click on a CDX link :
metadata://netarchivesuite.bnf.fr/crawl/index/cdx?
majorversion=2&minorversion=0&harvestid=28&jobid=17757&filename=BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz
it should contains such formatted lines:
http://location-vente-immobilier.leparisien.fr/1.1.2 37.25.56.70
20160209093941 text/html 27716
BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz
999651446 ac51351bffe7cd4a7ae5fac411ddcadb
http://www.leprogres.fr/lyon/lyon-8e 145.226.55.19 20160209093940
text/html 160322
BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz
999658163 d7198ed39931a7b6d02b6b4d509d15b3
http://m.jactiv.ouest-france.fr/vie-pratique/forme-sante/comment-bien-composer-son-petit-dejeuner-13116?utm_source=of.fr&utm_medium=coldroiteflux&utm_campaign=liens
of.fr 212.95.72.4 20160209093940 text/html 59567
BnF-17757-28-20160209090150-00000-ciblee_2016_gulliver115.bnf.fr.warc.gz
999686432 c2e8d7506d76f07cf4fa082f9fd1352f
If they are empty, that should explain why your ViewerProxy is not
working. And if this is the case, your deduplication is not working
either.
When this happens in our tests environnement, we managed to fix it by
using another minor version of Java 1.6. We are still using 1.6.0_17.
Best,
Sara
De : "Navarro Guillén,Soledad" <soledad.navarro at bne.es>
A : "'bja at statsbiblioteket.dk'" <bja at statsbiblioteket.dk>
Cc : Archivoweb <archivoweb at bne.es>, PE UCI Sistemas TSL
<pe.uci.sistemas.tsl at bne.es>, "'netarchivesuite-users at ml.sbforge.org'"
<netarchivesuite-users at ml.sbforge.org>, "Pérez Morillo,Mar"
<mar.perez at bne.es>, "García Arratia,Juan Carlos"
<juancarlos.garcia at bne.es>, "Monzón,Fernando" <f.monzon at bne.es>
Date : 10/02/2016 10:06
Objet : [Netarchivesuite-users] RV: Question about a problem with NAS QA
Viewer
Envoyé par : NetarchiveSuite-users
<netarchivesuite-users-bounces at ml.sbforge.org>
Hi Bjarne,
Thank you very much for your reply. Below you can see the replies from our
IT Team.
We hope they give you some clue that can help us to solve the problem.
Thank you for your help!
Regards,
Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102
De: Bjarne Andersen [mailto:bja at statsbiblioteket.dk]
Enviado el: lunes, 21 de diciembre de 2015 12:59
Para: 'netarchivesuite-users at ml.sbforge.org'
CC: Pérez Morillo, Mar; García Arratia, Juan Carlos; Archivoweb; Monzón,
Fernando
Asunto: RE: Question about a problem with NAS QA Viewer
Very good question.
Im not sure NAS gets tested with compressed WARCs since the netarchive.dk
always has been using non-compressed (W)ARCs.
Can you see if
/netarchive/WARC_Archive/filedir/5-metadata-1.warc
Actually looks like a WARC-file ?
# file /WARC/Archive_2/filedir/5-metadata-1.warc:
/WARC/Archive_2/filedir/5-metadata-1.warc: WARC Archive version 1.0\015
This is the content of the file header
-rwxrwxrwx 1 510 511 20M dic 2 10:10
/WARC/Archive_2/filedir/5-metadata-1.warc
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-12-01T16:19:57Z
WARC-Filename: 5-metadata-1.warc
WARC-Block-Digest: sha1:13007b224f5732e99238ad14ead1304f505a2ce5
WARC-Record-ID: <urn:uuid:63f0dce5-3936-4beb-a6f8-3ee02d5ea96e>
Content-Type: application/warc-fields
Content-Length: 231
software: NetarchiveSuite/Version: 4.4.1 status RELEASE/
https://sbforge.org/display/NAS
ip: 192.168.81.60
hostname: HDLS005.bne.local
conformsTo:
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
isPartOf: 5
And are there any entries in this file that looks like CDX-files for every
.warc.gz file that was generated during the harvest.
These are the entries that contain /5-metadata-1.warc concerning the
warc.gz files:
5-2-20151201115953-00000-HDLS005.bne.local.warc.gz -1 -1 642244893
5-2-20151201120002-00004-HDLS005.bne.local.warc.gz -1 -1 972523874
5-2-20151201115953-00002-HDLS005.bne.local.warc.gz -1 -1 733763695
5-2-20151201120002-00003-HDLS005.bne.local.warc.gz -1 -1 587720049
5-2-20151201115953-00001-HDLS005.bne.local.warc.gz -1 -1 869623474
2015-12-01 11:59:53.106 INFORMACIÓN thread-125
org.archive.io.WriterPoolMember.createFile() Opened
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00000-HDLS005.bne.local.warc.gz.open
2015-12-01 11:59:53.106 INFORMACIÓN thread-123
org.archive.io.WriterPoolMember.createFile() Opened
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00001-HDLS005.bne.local.warc.gz.open
2015-12-01 11:59:53.107 INFORMACIÓN thread-127
org.archive.io.WriterPoolMember.createFile() Opened
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00002-HDLS005.bne.local.warc.gz.open
2015-12-01 12:00:02.515 INFORMACIÓN thread-67
org.archive.io.WriterPoolMember.createFile() Opened
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00003-HDLS005.bne.local.warc.gz.open
2015-12-01 12:00:02.515 INFORMACIÓN thread-109
org.archive.io.WriterPoolMember.createFile() Opened
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00004-HDLS005.bne.local.warc.gz.open
2015-12-01 16:11:37.489 INFORMACIÓN thread-96
org.archive.io.WriterPoolMember.close() Closed
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00000-HDLS005.bne.local.warc.gz,
size 642244893
2015-12-01 16:11:37.490 INFORMACIÓN thread-96
org.archive.io.WriterPoolMember.close() Closed
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00004-HDLS005.bne.local.warc.gz,
size 972523874
2015-12-01 16:11:37.490 INFORMACIÓN thread-96
org.archive.io.WriterPoolMember.close() Closed
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00002-HDLS005.bne.local.warc.gz,
size 733763695
2015-12-01 16:11:37.490 INFORMACIÓN thread-96
org.archive.io.WriterPoolMember.close() Closed
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201120002-00003-HDLS005.bne.local.warc.gz,
size 587720049
2015-12-01 16:11:37.490 INFORMACIÓN thread-96
org.archive.io.WriterPoolMember.close() Closed
/netarchive/BNE/harvester_high/5_1448971188988/warcs/5-2-20151201115953-00001-HDLS005.bne.local.warc.gz,
size 869623474
WARC-Target-URI:
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00000-HDLS005.bne.local.warc.gz
WARC-Target-URI:
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201120002-00004-HDLS005.bne.local.warc.gz
WARC-Target-URI:
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00002-HDLS005.bne.local.warc.gz
WARC-Target-URI:
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201120002-00003-HDLS005.bne.local.warc.gz
WARC-Target-URI:
metadata://netarkivet.dk/crawl/index/cdx?majorversion=2&minorversion=0&harvestid=2&jobid=5&filename=5-2-20151201115953-00001-HDLS005.bne.local.warc.gz
The CDX-entries are generated by the crawler just after the crawl-finishes
and it might be this code that has failed because of the gzipped WARCs
These are examples of the CDX entries that contain the file:
2015-12-01T12:00:04.450Z 404 2842
http://www.20minutos.es/34616581/20minutos.es/portada_Position3 X
http://www.20minutos.es/
text/html #020 20151201120004275+164 sha1:MMF5BQCSKN4MY35645HOLIOZ3CCCJHHE
- content-size:3218
2015-12-01T12:00:04.452Z 302 198
http://www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/%20http://publi
.
atresadvertising.com/autopromos/MPW980x90px.jpg ER
http://www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/
text/h
tml #069 20151201120004156+294 sha1:HKJLZMVYW736P4I4UZA3JTFPSNA3F7ER -
content-size:612
2015-12-01T12:00:04.515Z 200 9722
http://www.grupo20minutos.com/img/gon.png EXE
http://www.grupo20minutos.com/contacto.html i
mage/png #016 20151201120004474+39 sha1:23HRHBEHEB3OCXMILOEGG7YLIOALXOIV -
content-size:10030
2015-12-01T12:00:04.530Z 200 246
http://www.sixtelekurs.fr/finfeed/antena3/images/bt_modulo_ibex-on.gif EE
http://www.sixtel
ekurs.fr/finfeed/antena3/portada_or.hts image/gif #007
20151201120004255+274 sha1:2AITMHIA45NDTX77OKJVOLOBLWNPPMIR -
content-size:460
2015-12-01T12:00:04.533Z 200 37095
http://publi.atresadvertising.com/autopromos/banner_afilados_clasico_980x90px.gif
ERR http:
//www.smartadserver.com/call/pubi/15272/114187/4634/S/%5Btimestamp%5D/%20
http://publi.atresadvertising.com/autopromos/MPW980x90px.jpg
image/gif #002 20151201120004453+77 sha1:BCB4TS63C4V7P4PJDJW5N3W66KVGTMEK
- content-size:37423
2015-12-01T12:00:04.551Z 404 1245
http://logi242.xiti.com/robots.txt EEP
http://logi242.xiti.com/hit.xiti?s=513357&s2=1&p=por
tada::sin_url&di=&an=&ac= text/html #034 20151201120004424+126
sha1:AS23RBWCBWELK7XKNWH7RATCJJFMDZI5 - content-size:1424
2015-12-01T12:00:04.581Z 1 67 dns:googleads.g.doubleclick.net
EXP http://googleads.g.doubleclick.net/pagead/viewthroughco
nversion/941057382/?value=0&guid=ON&script=0 text/dns #041
20151201120004577+3 sha1:KNTS5M37XFDS433P5BZH4EUWAL3F6NQP - content-size:6
7
2015-12-01T12:00:04.590Z 200 26
https://download.macromedia.com/robots.txt EEP
https://download.macromedia.com/pub/shockwav
e/cabs/flash/swflash.cab text/plain #050 20151201120004004+586
sha1:MNSXZO35OCDMK2YM2TS4NGM3W2BWMSDI - content-size:272
2015-12-01T12:00:04.597Z 200 27289
http://pagead2.googlesyndication.com/pagead/show_ads.js EX
http://www.google.com/recaptcha/
api/js/recaptcha_ajax.js text/javascript #100 20151201120004521+67
sha1:ZYPHXVGKBLVTSFXLPEIBSIY3OE6AEXIO - content-size:27769,3t
2015-12-01T12:00:04.597Z 200 22736
https://fonts.gstatic.com/s/montserrat/v6/IQHow_FEYlDC4Gzy_m8fcvEr6Hm6RMS0v1dtXsGir4g.ttf
E
E https://fonts.googleapis.com/css?family=Montserrat:700,400 font/ttf #033
20151201120004408+186 sha1:J66G67DFB45TQSWJNXLUZANXI2KSSRR
M - content-size:23225
Best
Bjarne Andersen
From: NetarchiveSuite-users [
mailto:netarchivesuite-users-bounces at ml.sbforge.org] On Behalf Of Navarro
Guillén, Soledad
Sent: Monday, December 21, 2015 12:53 PM
To: 'netarchivesuite-users at ml.sbforge.org' <
netarchivesuite-users at ml.sbforge.org>
Cc: Pérez Morillo, Mar <mar.perez at bne.es>; García Arratia, Juan Carlos <
juancarlos.garcia at bne.es>; Archivoweb <archivoweb at bne.es>; Monzón,
Fernando <f.monzon at bne.es>
Subject: [Netarchivesuite-users] Question about a problem with NAS QA
Viewer
Dear all,
In the National Library of Spain Web Archive we have recently changed from
NAS 4.2 to NAS 4.4 and we have a problem with the NAS QA Viewer.
Using compression in NAS 4.4 templates (changing only what is highlighted,
only in the section of WARC, not of ARC) the NAS QA viewer does not work.
The files generated in the harvest are the type warc.gz
<newObject name="WARCArchiver#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
<map name="rules">
</map>
</newObject>
<boolean name="compress">true</boolean>
This is the error that appears in the graphic interface:
And this is the error that appears in the logs:
DETALLADO: Caught exception while running batch job on file
/netarchive/WARC_Archive/filedir/5-metadata-1.warc, position 4232857:
null
java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
at java.util.regex.Matcher.reset(Matcher.java:308)
at java.util.regex.Matcher.<init>(Matcher.java:228)
at java.util.regex.Pattern.matcher(Pattern.java:1088)
at
dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob.processRecord(GetMetadataArchiveBatchJob.java:95)
at
dk.netarkivet.common.utils.archive.ArchiveBatchJob.processFile(ArchiveBatchJob.java:124)
at
dk.netarkivet.common.utils.batch.BatchLocalFiles.processFile(BatchLocalFiles.java:168)
at
dk.netarkivet.common.utils.batch.BatchLocalFiles.run(BatchLocalFiles.java:115)
at dk.netarkivet.archive.bitarchive.Bitarchive.batch(Bitarchive.java:246)
at
dk.netarkivet.archive.bitarchive.distribute.BitarchiveServer$1.run(BitarchiveServer.java:428)
dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
DETALLADO: Batch: Job
dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob, with
arguments: URLMatcher = metadata://[^/]*/crawl/index/cdx.*, mimeMatcher =
application/x-cdx finished at Mon Dec 14 11:04:50 CET 2015
dic 14, 2015 11:04:50 AM dk.netarkivet.archive.bitarchive.Bitarchive batch
INFORMACIÓN: Finished batch job on bitarchive application with id
'192.168.81.37_BitApp_2':
'dk.netarkivet.harvester.indexserver.GetMetadataArchiveBatchJob', on
filename-pattern: '5-metadata-[0-9]+.(w)?arc' + with result: 1 failures in
processing 1 files at 192.168.81.37_BitApp_2
Do you know if there is a way to solve it?
Thank you very much and happy Christmas,
Soledad Navarro
Área de Gestión del Depósito de las Publicaciones en Línea
Biblioteca Nacional de España
Paseo de Recoletos, 20-22. Madrid 28001
Tlf: (0034)91 516 81 18 - Ext. 218
Fax: (0034) 915168102
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
Exposition De Rouge et de Noir. Les vases grecs de la collection de Luynes - jusqu'au 1 er mars 2016 - BnF - Richelieu Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20160210/b440c874/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 5994 bytes
Desc: not available
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20160210/b440c874/attachment-0001.gif>
More information about the NetarchiveSuite-users
mailing list