[Netarchivesuite-curator] October update from Netarkivet/Denmark

Anders Klindt Myrvoll ANKM at kb.dk
Mon Oct 5 22:03:54 CEST 2020


Dear all,

In brief, here is what we worked on since our last meeting:

Broad crawl
Step 2 is proceeding in a great fashion.

Event crawl
We decided to continue with the event crawl on Corona in Denmark but with lower frequency and. 0-hop sites reduced greatly, and with minimal curational activity.

Alexandre, trainee
Arrived and is up and running, working remotely from Copenhagen with the rest of the team. We are almost done with the intro-program and are looking into what will give most value to Alexandre, Netarkivet and also BnF.

IT-University in Copenhagen:
The collaboration with the IT-University in Copenhagen is moving forward.

Youtube
We have experimented with getting embedded video-content and so far the results are great (except WARC-validation is not valid with re-visits)

WARC-file-validation
We are working on finalizing a workflow from Webrecorder/Conifer.org to Netarkivet. To be able to validate WARC-files correctly  is a big part of getting the right level of preservation (we use JWAT for this). But it´s a bit complicated - see for instance this OPF blog by Remco van Veenendaal from Holland: https://openpreservation.org/blogs/warc-validation-tool-experiences/

(How) are you validating WARC-files? And what is the future on this?

Added info from Tue
We have just discovered following error's in our warc files.  It seems to go back to when we  activated the revisit generation in 2017/2018 after the compression of the Netachive.
It is our NAS code, which generates a wrong 'WARC-Payload-Digest' format.
We have never before tested, that our revisit 'WARC-Payload-Digest' format was valid according to the WARC standard

e.g.

Error in '/home/prod/317160-265-20190806070159701-00000-sb-prod-har-006.statsbiblioteket.dk.warc.gz'
       Offset: 160011 (0x2710b)
  Record Type: 'revisit'
         Type: INVALID_EXPECTED
       Entity: 'WARC-Payload-Digest' value
        Value: ICLH3F6J3NMEIBRGD7ICP255OXIUDRWH
     Expected: <digest-algorithm>:<digest-encoded>

See:
http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf 2008
http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1-1_latestdraft.pdf 2017
We are missing the "<digest-algorithm>: "
BTW The webrecorder revisit warc files have the same revisit 'WARC-Payload-Digest' format issue.

All the best on behalf of the Netarkivet-Team
Anders

Anders Klindt Myrvoll
Faglig leder - Netarkivet
Programme Manager - the Danish web archive

Digital Kulturarv
Digital Cultural Heritage

+45 26850080
ANKM at kb.dk<mailto:ANKM at kb.dk>


[cid:image003.png at 01D50424.1ED49640]

Det Kgl. Bibliotek
Royal Danish Library

Søren Kierkegaards Plads 1
DK-1221 København K
+45 3347 4747

CVR 2898 8842
EAN 5798 000 795297


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20201005/9561c54e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 6924 bytes
Desc: image001.png
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20201005/9561c54e/attachment.png>


More information about the Netarchivesuite-curator mailing list