[Netarchivesuite-users] Questions about deduplication (and reduplication)
Peter Svanberg
Peter.Svanberg at kb.se
Mon Jan 13 17:31:01 CET 2020
Hello!
I'm trying to understand how NAS and Heritrix handles deduplication, which lead to an internal discussion about the overall pros and cons of ditto. I then found Kristinn Sigurðsson's interesting web archiving blog articles. He has written about de- and reduplication: https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html
Some short questions about the deduplication in NAS (is.hi.bok.deduplicator.DeDuplicator) that I would appreciate quick answers on (from all NAS user sites):
1) Do you use deduplication for snapshot harvests (broad crawls)?
2) Do you use deduplication for snapshot harvests?
3) Which matching method do you use - DIGEST or URL?
4) Do you "restart" the deduplication at intervals? How long intervals?
By (4) I mean you do a harvest with no deduplication, limiting the number of dependencies between WARC files. (Somewhat like total and incremental backups.) Maybe you just do deduplication between the 2-3 steps in a broad crawl? Or between the last X broad crawls?
Regards,
-----
Peter Svanberg
National Library of Sweden
Phone: +46 10 709 32 78
E-mail: peter.svanberg at kb.se
Web: www.kb.se
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20200113/c1a0dfcd/attachment.html>
More information about the NetarchiveSuite-users
mailing list