[Netarchivesuite-users] Questions about deduplication (and reduplication)

Mon Jan 13 17:31:01 CET 2020

Hello!

I'm trying to understand how NAS and Heritrix handles deduplication, which lead to an internal discussion about the overall pros and cons of ditto. I then found Kristinn Sigurðsson's interesting web archiving blog articles. He has written about de- and reduplication: https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html

Some short questions about the deduplication in NAS (is.hi.bok.deduplicator.DeDuplicator) that I would appreciate quick answers on (from all NAS user sites):

1)      Do you use deduplication for snapshot harvests (broad crawls)?

2)      Do you use deduplication for snapshot harvests?

3)      Which matching method do you use - DIGEST or URL?

4)      Do you "restart" the deduplication at intervals? How long intervals?

By (4) I mean you do a harvest with no deduplication, limiting the number of dependencies between WARC files. (Somewhat like total and incremental backups.) Maybe you just do deduplication between  the 2-3 steps in a broad crawl? Or between the last X broad crawls?

Regards,
-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20200113/c1a0dfcd/attachment.html>