[Netarchivesuite-users] Questions about deduplication (and reduplication)

Peter Svanberg Peter.Svanberg at kb.se
Mon Jan 13 17:50:57 CET 2020

Sorry, second question should obviously have been

2)      Do you use deduplication for SELECTIVE harvests?


13 jan. 2020 kl. 17:31 skrev Peter Svanberg <Peter.Svanberg at kb.se>:


I’m trying to understand how NAS and Heritrix handles deduplication, which lead to an internal discussion about the overall pros and cons of ditto. I then found Kristinn Sigurðsson’s interesting web archiving blog articles. He has written about de- and reduplication: https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html

Some short questions about the deduplication in NAS (is.hi.bok.deduplicator.DeDuplicator) that I would appreciate quick answers on (from all NAS user sites):

1)      Do you use deduplication for snapshot harvests (broad crawls)?

2)      Do you use deduplication for snapshot harvests?

3)      Which matching method do you use – DIGEST or URL?

4)      Do you “restart” the deduplication at intervals? How long intervals?

By (4) I mean you do a harvest with no deduplication, limiting the number of dependencies between WARC files. (Somewhat like total and incremental backups.) Maybe you just do deduplication between  the 2–3 steps in a broad crawl? Or between the last X broad crawls?


Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se

NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20200113/c399290f/attachment-0001.html>

More information about the NetarchiveSuite-users mailing list