<font size=2 face="Arial">Hi Peter,</font><br><br><font size=2 face="Arial">If you set deduplication to true in NAS harvesting
settings and at profile level, then Heritrix will create revisit records
(not revisit files) in the harvesting workflow, so along with other WARC
request, response and metadata records.</font><br><font size=2 face="Arial">Each time the crawler tries to fetch a binary
web component, it lookups in the lucene duplicates index and if there,
it will mark it in the crawl log and create a complete WARC revisit record.
</font><br><br><tt><font size=3>2020-01-14T11:15:47.302Z 200
2065 </font></tt><a href="https://img.lemde.fr/2015/10/01/0/123/3253/2169/110/74/60/0/a55eb3e_25814-1pls9ni.jpg"><tt><font size=3>https://img.lemde.fr/2015/10/01/0/123/3253/2169/110/74/60/0/a55eb3e_25814-1pls9ni.jpg</font></tt></a><tt><font size=3>LE </font></tt><a href=https://www.lemonde.fr/services/><tt><font size=3>https://www.lemonde.fr/services/</font></tt></a><tt><font size=3>image/jpeg #118 20200114111547158+32 sha1:WPRRSOTVFZNNIDJVPHMT5LDDNDGIMPRR
</font></tt><a href=https://www.lemonde.fr/afrique/><tt><font size=3>https://www.lemonde.fr/afrique/</font></tt></a><tt><font size=3><b>duplicate:"BnF-32274-28-20191212105654-00003-ciblee_2019_fogg120.bnf.fr.warc.gz,295532254,20191212110808000</b>",content-size:2579</font></tt><br><br><font size=2 face="Arial">OpenWayback CDX indexer creates CDX lines
for these records that OpenWayback playbacks very well.</font><br><font size=2 face="Arial">I imagine th pywb also plays them without
any problem.</font><br><br><font size=2 face="Arial">The oldest revist can be very very old if
the file hasn't changed and is still being crawled.</font><br><font size=2 face="Arial">I don't know how old is our oldest (not before
late 2016, since it came with NAS 5.2 :</font><br><a href="https://sbforge.org/display/NAS/NetarchiveSuite+5.2.x+Release+Notes"><font size=2 face="Arial">https://sbforge.org/display/NAS/NetarchiveSuite+5.2.x+Release+Notes</font></a><font size=2 face="Arial">)</font><br><font size=2 face="Arial"> </font><br><font size=2 face="Arial">Regarding space saving, we have precise numbers
:</font><br><font size=2 face="Arial">For our 2019 focused crawls, we harvested
107,49TB of uncompressed data and didn't harvest 34,77TB we "saved"
from deduplication (i.e. 24,5%).</font><br><font size=2 face="Arial">For our 2019 broad crawl, we harvested 234,64TB
of uncompressed data and didn't harvest 66,53TB we "saved" from
deduplication (i.e 22%). </font><br><font size=2 face="Arial">So at our scale, deduplication saves a fourth
of our storage, which is huge !</font><br><br><font size=2 face="Arial">Sara</font><br><br><br><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">De :
</font><font size=1 face="sans-serif">"Peter Svanberg"
<Peter.Svanberg@kb.se></font><br><font size=1 color=#5f5f5f face="sans-serif">A :
</font><font size=1 face="sans-serif">"netarchivesuite-users@ml.sbforge.org"
<netarchivesuite-users@ml.sbforge.org></font><br><font size=1 color=#5f5f5f face="sans-serif">Date :
</font><font size=1 face="sans-serif">13/01/2020 23:05</font><br><font size=1 color=#5f5f5f face="sans-serif">Objet :
</font><font size=1 face="sans-serif">Re: [Netarchivesuite-users]
Questions about deduplication (and reduplication)</font><br><font size=1 color=#5f5f5f face="sans-serif">Envoyé par :
</font><font size=1 face="sans-serif">"NetarchiveSuite-users"
<netarchivesuite-users-bounces@ml.sbforge.org></font><br><hr noshade><br><br><br><font size=3>Thanks, Sara!<br></font><br><font size=3>So, when reduplicating, e.g. at Wayback or Pyweb usage,
all potential revisit target files must be reachable – not a problem?
Kristinn mentioned that generating indexes (of content) can take much longer
as it have to look up in url indexes and open a lot of files. Something
you (or others) have experienced?</font><br><br><font size=3>Do you have any idea of how old the oldest revisit target
to recent warc files could be? Five, maybe ten years, then?</font><br><br><font size=3>And I add a fifth question:</font><br><br><font size=3>5) How much space do you save – just approximately.</font><br><br><font size=3> Peter</font><br><br><br><font size=3>13 jan. 2020 kl. 17:47 skrev "sara.aubry@bnf.fr"
<sara.aubry@bnf.fr>:<br></font><br><font size=3></font><font size=2 face="sans-serif">Hi Peter,</font><font size=3><br></font><font size=2 face="sans-serif"><br>For BnF,</font><font size=3 face="Calibri"><br>1) yes<br>2) you probably mean focused crawls: yes <br>3) URL<br>4) Only when we have a major change in the crawler or the data format.
Which means, the least possible.<br>Because it really save a lot of space, and also because we don't care about
intervals between WARC files: that's why WARC revisit records were made
for.<br>Deduplication also sometimes incidentally restarts when the previous capture
of a harvest is not finished (either at crawl stage or post-processing
stage) or crashed.</font><font size=3><br></font><font size=3 face="Calibri"><br>Best,</font><font size=3><br></font><font size=3 face="Calibri"><br>Sara</font><font size=3><br><br><br><br></font><font size=1 color=#5f5f5f face="sans-serif"><br>De : </font><font size=1 face="sans-serif">"Peter
Svanberg" <Peter.Svanberg@kb.se></font><font size=1 color=#5f5f5f face="sans-serif"><br>A : </font><font size=1 face="sans-serif">"netarchivesuite-users@ml.sbforge.org"
<netarchivesuite-users@ml.sbforge.org></font><font size=1 color=#5f5f5f face="sans-serif"><br>Date : </font><font size=1 face="sans-serif">13/01/2020
17:31</font><font size=1 color=#5f5f5f face="sans-serif"><br>Objet : </font><font size=1 face="sans-serif">[Netarchivesuite-users]
Questions about deduplication (and reduplication)</font><font size=1 color=#5f5f5f face="sans-serif"><br>Envoyé par : </font><font size=1 face="sans-serif">"NetarchiveSuite-users"
<netarchivesuite-users-bounces@ml.sbforge.org></font><font size=3><br></font><hr noshade><font size=3><br><br></font><font size=3 face="Calibri"><br>Hello!<br> <br>I’m trying to understand how NAS and Heritrix handles deduplication, which
lead to an internal discussion about the overall pros and cons of ditto.
I then found Kristinn Sigurðsson’s interesting web archiving blog articles.
He has written about de- and reduplication: </font><a href="https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html"><font size=3 color=#0082bf face="Calibri"><u>https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html</u></font></a><font size=3 face="Calibri"><br> <br>Some short questions about the deduplication in NAS (is.hi.bok.deduplicator.DeDuplicator)
that I would appreciate quick answers on (from all NAS user sites):<br> <br>1) Do you use deduplication for snapshot harvests (broad
crawls)?<br>2) Do you use deduplication for snapshot harvests?<br>3) Which matching method do you use – DIGEST or URL?<br>4) Do you “restart” the deduplication at intervals?
How long intervals?<br> <br>By (4) I mean you do a harvest with no deduplication, limiting the number
of dependencies between WARC files. (Somewhat like total and incremental
backups.) Maybe you just do deduplication between the 2–3 steps
in a broad crawl? Or between the last X broad crawls?<br> <br>Regards, </font><font size=3 face="Arial"><br>-----<br><br>Peter Svanberg<br><br>National Library of Sweden<br>Phone: +46 10 709 32 78<br><br>E-mail</font><font size=3 face="Calibri">: </font><font size=3 face="Arial">peter.svanberg@kb.se<br>Web</font><font size=3 face="Calibri">: </font><a href=www.kb.se><font size=3 color=blue face="Arial"><u>www.kb.se</u></font></a><font size=3><br></font><font size=3 face="Calibri"><br> <br> </font><tt><font size=2>_______________________________________________<br>NetarchiveSuite-users mailing list<br>NetarchiveSuite-users@ml.sbforge.org</font></tt><font size=3 color=blue><u><br></u></font><a href="https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users"><tt><font size=2 color=blue><u>https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users</u></font></tt></a><font size=3><br></font><font size=3 face="sans-serif"><br></font><hr><p><font size=3 face="sans-serif">Exposition </font><a href="https://www.bnf.fr/fr/agenda/tolkien-voyage-en-terre-du-milieu"><font size=3 color=blue face="sans-serif"><b><i><u>Tolkien,
voyage en Terre du Milieu</u></i></b></font></a><font size=3 face="sans-serif">- du 22 octobre 2019 au 16 février 2020 - BnF - François-Mitterrand</font><p><font size=3 color=#008000 face="sans-serif"><b>Avant d'imprimer, pensez
à l'environnement.</b></font><p><font size=3>_______________________________________________<br>NetarchiveSuite-users mailing list<br>NetarchiveSuite-users@ml.sbforge.org<br></font><a href="https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users"><font size=3>https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users</font></a><tt><font size=2>_______________________________________________<br>NetarchiveSuite-users mailing list<br>NetarchiveSuite-users@ml.sbforge.org<br></font></tt><a href="https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users"><tt><font size=2>https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users</font></tt></a><tt><font size=2><br></font></tt><p><font face="sans-serif"><hr />
<p>Exposition <strong><em><a href="https://www.bnf.fr/fr/agenda/tolkien-voyage-en-terre-du-milieu">Tolkien, voyage en Terre du Milieu</a></em></strong> - du 22 octobre 2019 au 16 février 2020 - BnF - François-Mitterrand</p>
<p style="color:#008000"><strong>Avant d'imprimer, pensez à l'environnement.</strong></p></font>