[Netarchivesuite-users] Questions about deduplication (and reduplication)
Peter.Svanberg at kb.se
Mon Jan 13 17:50:57 CET 2020
Sorry, second question should obviously have been
2) Do you use deduplication for SELECTIVE harvests?
13 jan. 2020 kl. 17:31 skrev Peter Svanberg <Peter.Svanberg at kb.se>:
I’m trying to understand how NAS and Heritrix handles deduplication, which lead to an internal discussion about the overall pros and cons of ditto. I then found Kristinn Sigurðsson’s interesting web archiving blog articles. He has written about de- and reduplication: https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html
Some short questions about the deduplication in NAS (is.hi.bok.deduplicator.DeDuplicator) that I would appreciate quick answers on (from all NAS user sites):
1) Do you use deduplication for snapshot harvests (broad crawls)?
2) Do you use deduplication for snapshot harvests?
3) Which matching method do you use – DIGEST or URL?
4) Do you “restart” the deduplication at intervals? How long intervals?
By (4) I mean you do a harvest with no deduplication, limiting the number of dependencies between WARC files. (Somewhat like total and incremental backups.) Maybe you just do deduplication between the 2–3 steps in a broad crawl? Or between the last X broad crawls?
National Library of Sweden
Phone: +46 10 709 32 78
E-mail: peter.svanberg at kb.se
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NetarchiveSuite-users