[Netarchivesuite-users] Questions about deduplication (and reduplication)

sara.aubry at bnf.fr sara.aubry at bnf.fr
Mon Jan 13 17:47:04 CET 2020

Hi Peter,

For BnF,
1) yes
2) you probably mean focused crawls: yes 
3) URL
4) Only when we have a major change in the crawler or the data format. 
Which means, the least possible.
Because it really save a lot of space, and also because we don't care 
about intervals between WARC files: that's why WARC revisit records were 
made for.
Deduplication also sometimes incidentally restarts when the previous 
capture of a harvest is not finished (either at crawl stage or 
post-processing stage) or crashed.



De :    "Peter Svanberg" <Peter.Svanberg at kb.se>
A :     "netarchivesuite-users at ml.sbforge.org" 
<netarchivesuite-users at ml.sbforge.org>
Date :  13/01/2020 17:31
Objet : [Netarchivesuite-users] Questions about deduplication (and 
Envoyé par :    "NetarchiveSuite-users" 
<netarchivesuite-users-bounces at ml.sbforge.org>

I’m trying to understand how NAS and Heritrix handles deduplication, which 
lead to an internal discussion about the overall pros and cons of ditto. I 
then found Kristinn Sigurðsson’s interesting web archiving blog articles. 
He has written about de- and reduplication: 
Some short questions about the deduplication in NAS 
(is.hi.bok.deduplicator.DeDuplicator) that I would appreciate quick 
answers on (from all NAS user sites):
1)      Do you use deduplication for snapshot harvests (broad crawls)?
2)      Do you use deduplication for snapshot harvests?
3)      Which matching method do you use – DIGEST or URL?
4)      Do you “restart” the deduplication at intervals? How long 
By (4) I mean you do a harvest with no deduplication, limiting the number 
of dependencies between WARC files. (Somewhat like total and incremental 
backups.) Maybe you just do deduplication between  the 2–3 steps in a 
broad crawl? Or between the last X broad crawls?

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se

NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org

Exposition  Tolkien, voyage en Terre du Milieu  - du 22 octobre 2019 au 16 février 2020 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20200113/0a66cc7b/attachment.html>

More information about the NetarchiveSuite-users mailing list