<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body dir="auto">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">Thanks, Sara!<br>
<br>
<div dir="ltr">
<div><span class="Apple-style-span" style="-webkit-tap-highlight-color: rgba(26, 26, 26, 0.292969); -webkit-composition-fill-color: rgba(175, 192, 227, 0.230469); -webkit-composition-frame-color: rgba(77, 128, 180, 0.230469); ">So, when reduplicating, e.g.
at Wayback or Pyweb usage, all potential revisit target files must be reachable – not a problem? Kristinn mentioned that generating indexes (of content) can take much longer as it have to look up in url indexes and open a lot of files. Something you (or others)
have experienced?</span></div>
<div><span class="Apple-style-span" style="-webkit-tap-highlight-color: rgba(26, 26, 26, 0.292969); -webkit-composition-fill-color: rgba(175, 192, 227, 0.230469); -webkit-composition-frame-color: rgba(77, 128, 180, 0.230469); "><br>
</span></div>
<div><span class="Apple-style-span" style="-webkit-tap-highlight-color: rgba(26, 26, 26, 0.292969); -webkit-composition-fill-color: rgba(175, 192, 227, 0.230469); -webkit-composition-frame-color: rgba(77, 128, 180, 0.230469); ">Do you have any idea of how old
the oldest revisit target to recent warc files could be? Five, maybe ten years, then?</span></div>
<div><span class="Apple-style-span" style="-webkit-tap-highlight-color: rgba(26, 26, 26, 0.292969); -webkit-composition-fill-color: rgba(175, 192, 227, 0.230469); -webkit-composition-frame-color: rgba(77, 128, 180, 0.230469); "><br>
</span></div>
<div><span style="-webkit-tap-highlight-color: rgba(26, 26, 26, 0.294118);">And I add a fifth question:</span></div>
<div><span style="-webkit-tap-highlight-color: rgba(26, 26, 26, 0.294118);"><br>
</span></div>
<div><span style="-webkit-tap-highlight-color: rgba(26, 26, 26, 0.294118);">5) How much space do you save – just approximately.</span></div>
<div><span class="Apple-style-span" style="-webkit-tap-highlight-color: rgba(26, 26, 26, 0.292969); -webkit-composition-fill-color: rgba(175, 192, 227, 0.230469); -webkit-composition-frame-color: rgba(77, 128, 180, 0.230469); "><br>
</span></div>
<div> Peter</div>
<div><br>
</div>
</div>
<div dir="ltr"><br>
<blockquote type="cite">13 jan. 2020 kl. 17:47 skrev "sara.aubry@bnf.fr" <sara.aubry@bnf.fr>:<br>
<br>
</blockquote>
</div>
<blockquote type="cite">
<div dir="ltr"><font size="2" face="sans-serif">Hi Peter,</font><br>
<br>
<font size="2" face="sans-serif">For BnF,</font><br>
<font size="3" face="Calibri">1) yes</font><br>
<font size="3" face="Calibri">2) you probably mean focused crawls: yes </font><br>
<font size="3" face="Calibri">3) URL</font><br>
<font size="3" face="Calibri">4) Only when we have a major change in the crawler or the data format. Which means, the least possible.</font><br>
<font size="3" face="Calibri">Because it really save a lot of space, and also because we don't care about intervals between WARC files: that's why WARC revisit records were made for.</font><br>
<font size="3" face="Calibri">Deduplication also sometimes incidentally restarts when the previous capture of a harvest is not finished (either at crawl stage or post-processing stage) or crashed.</font><br>
<br>
<font size="3" face="Calibri">Best,</font><br>
<br>
<font size="3" face="Calibri">Sara</font><br>
<br>
<br>
<br>
<br>
<font size="1" color="#5f5f5f" face="sans-serif">De : </font><font size="1" face="sans-serif">"Peter Svanberg" <Peter.Svanberg@kb.se></font><br>
<font size="1" color="#5f5f5f" face="sans-serif">A : </font><font size="1" face="sans-serif">"netarchivesuite-users@ml.sbforge.org" <netarchivesuite-users@ml.sbforge.org></font><br>
<font size="1" color="#5f5f5f" face="sans-serif">Date : </font><font size="1" face="sans-serif">13/01/2020 17:31</font><br>
<font size="1" color="#5f5f5f" face="sans-serif">Objet : </font><font size="1" face="sans-serif">[Netarchivesuite-users] Questions about deduplication (and reduplication)</font><br>
<font size="1" color="#5f5f5f" face="sans-serif">Envoyé par : </font><font size="1" face="sans-serif">"NetarchiveSuite-users" <netarchivesuite-users-bounces@ml.sbforge.org></font><br>
<hr noshade="">
<br>
<br>
<br>
<font size="3" face="Calibri">Hello!</font><br>
<font size="3" face="Calibri"> </font><br>
<font size="3" face="Calibri">I’m trying to understand how NAS and Heritrix handles deduplication, which lead to an internal discussion about the overall pros and cons of ditto. I then found Kristinn Sigurðsson’s interesting web archiving blog articles. He
has written about de- and reduplication: </font><a href="https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html"><font size="3" color="#0082bf" face="Calibri"><u>https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html</u></font></a><br>
<font size="3" face="Calibri"> </font><br>
<font size="3" face="Calibri">Some short questions about the deduplication in NAS (is.hi.bok.deduplicator.DeDuplicator) that I would appreciate quick answers on (from all NAS user sites):</font><br>
<font size="3" face="Calibri"> </font><br>
<font size="3" face="Calibri">1) Do you use deduplication for snapshot harvests (broad crawls)?</font><br>
<font size="3" face="Calibri">2) Do you use deduplication for snapshot harvests?</font><br>
<font size="3" face="Calibri">3) Which matching method do you use – DIGEST or URL?</font><br>
<font size="3" face="Calibri">4) Do you “restart” the deduplication at intervals? How long intervals?</font><br>
<font size="3" face="Calibri"> </font><br>
<font size="3" face="Calibri">By (4) I mean you do a harvest with no deduplication, limiting the number of dependencies between WARC files. (Somewhat like total and incremental backups.) Maybe you just do deduplication between the 2–3 steps in a broad crawl?
Or between the last X broad crawls?</font><br>
<font size="3" face="Calibri"> </font><br>
<font size="3" face="Calibri">Regards, </font><br>
<font size="3" face="Arial">-----<br>
<br>
Peter Svanberg</font><font size="3" face="Calibri"><br>
</font><font size="3" face="Arial"><br>
National Library of Sweden<br>
Phone: +46 10 709 32 78</font><font size="3" face="Calibri"><br>
</font><font size="3" face="Arial"><br>
E-mail</font><font size="3" face="Calibri">: </font><font size="3" face="Arial">peter.svanberg@kb.se<br>
Web</font><font size="3" face="Calibri">: </font><a href="www.kb.se"><font size="3" face="Arial">www.kb.se</font></a><font size="3" face="Calibri"><br>
</font><br>
<font size="3" face="Calibri"> </font><br>
<font size="3" face="Calibri"> </font><tt><font size="2">_______________________________________________<br>
NetarchiveSuite-users mailing list<br>
NetarchiveSuite-users@ml.sbforge.org<br>
</font></tt><a href="https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users"><tt><font size="2">https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users</font></tt></a><tt><font size="2"><br>
</font></tt><br>
<font face="sans-serif">
<hr>
<p>Exposition <strong><em><a href="https://www.bnf.fr/fr/agenda/tolkien-voyage-en-terre-du-milieu">Tolkien, voyage en Terre du Milieu</a></em></strong> - du 22 octobre 2019 au 16 février 2020 - BnF - François-Mitterrand</p>
<p style="color:#008000"><strong>Avant d'imprimer, pensez à l'environnement.</strong></p>
</font><span>_______________________________________________</span><br>
<span>NetarchiveSuite-users mailing list</span><br>
<span>NetarchiveSuite-users@ml.sbforge.org</span><br>
<span>https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users</span><br>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</body>
</html>