[Netarchivesuite-users] Questions about deduplication (and reduplication)
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Mon Jan 13 17:47:04 CET 2020
2) you probably mean focused crawls: yes
4) Only when we have a major change in the crawler or the data format.
Which means, the least possible.
Because it really save a lot of space, and also because we don't care
about intervals between WARC files: that's why WARC revisit records were
Deduplication also sometimes incidentally restarts when the previous
capture of a harvest is not finished (either at crawl stage or
post-processing stage) or crashed.
De : "Peter Svanberg" <Peter.Svanberg at kb.se>
A : "netarchivesuite-users at ml.sbforge.org"
<netarchivesuite-users at ml.sbforge.org>
Date : 13/01/2020 17:31
Objet : [Netarchivesuite-users] Questions about deduplication (and
Envoyé par : "NetarchiveSuite-users"
<netarchivesuite-users-bounces at ml.sbforge.org>
I’m trying to understand how NAS and Heritrix handles deduplication, which
lead to an internal discussion about the overall pros and cons of ditto. I
then found Kristinn Sigurðsson’s interesting web archiving blog articles.
He has written about de- and reduplication:
Some short questions about the deduplication in NAS
(is.hi.bok.deduplicator.DeDuplicator) that I would appreciate quick
answers on (from all NAS user sites):
1) Do you use deduplication for snapshot harvests (broad crawls)?
2) Do you use deduplication for snapshot harvests?
3) Which matching method do you use – DIGEST or URL?
4) Do you “restart” the deduplication at intervals? How long
By (4) I mean you do a harvest with no deduplication, limiting the number
of dependencies between WARC files. (Somewhat like total and incremental
backups.) Maybe you just do deduplication between the 2–3 steps in a
broad crawl? Or between the last X broad crawls?
National Library of Sweden
Phone: +46 10 709 32 78
E-mail: peter.svanberg at kb.se
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
Exposition Tolkien, voyage en Terre du Milieu - du 22 octobre 2019 au 16 février 2020 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NetarchiveSuite-users