[Netarchivesuite-users] Questions about deduplication (and reduplication)
Peter Svanberg
Peter.Svanberg at kb.se
Mon Jan 13 17:50:57 CET 2020
Sorry, second question should obviously have been
2) Do you use deduplication for SELECTIVE harvests?
/Peter
13 jan. 2020 kl. 17:31 skrev Peter Svanberg <Peter.Svanberg at kb.se>:
Hello!
I’m trying to understand how NAS and Heritrix handles deduplication, which lead to an internal discussion about the overall pros and cons of ditto. I then found Kristinn Sigurðsson’s interesting web archiving blog articles. He has written about de- and reduplication: https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html
Some short questions about the deduplication in NAS (is.hi.bok.deduplicator.DeDuplicator) that I would appreciate quick answers on (from all NAS user sites):
1) Do you use deduplication for snapshot harvests (broad crawls)?
2) Do you use deduplication for snapshot harvests?
3) Which matching method do you use – DIGEST or URL?
4) Do you “restart” the deduplication at intervals? How long intervals?
By (4) I mean you do a harvest with no deduplication, limiting the number of dependencies between WARC files. (Somewhat like total and incremental backups.) Maybe you just do deduplication between the 2–3 steps in a broad crawl? Or between the last X broad crawls?
Regards,
-----
Peter Svanberg
National Library of Sweden
Phone: +46 10 709 32 78
E-mail: peter.svanberg at kb.se
Web: www.kb.se
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20200113/c399290f/attachment-0001.html>
More information about the NetarchiveSuite-users
mailing list