[Netarchivesuite-users] Questions about deduplication (and reduplication)

Tue Jan 14 12:35:38 CET 2020

Hi Peter,

If you set deduplication to true in NAS harvesting settings and at profile 
level, then Heritrix will create revisit records (not revisit files) in 
the harvesting workflow, so along with other WARC request, response and 
metadata records.
Each time the crawler tries to fetch a binary web component, it lookups in 
the lucene duplicates index and if there, it will mark it in the crawl log 
and create a complete WARC revisit record. 

2020-01-14T11:15:47.302Z   200       2065 
https://img.lemde.fr/2015/10/01/0/123/3253/2169/110/74/60/0/a55eb3e_25814-1pls9ni.jpg 
LE https://www.lemonde.fr/services/ image/jpeg #118 20200114111547158+32 
sha1:WPRRSOTVFZNNIDJVPHMT5LDDNDGIMPRR https://www.lemonde.fr/afrique/ 
duplicate:"
BnF-32274-28-20191212105654-00003-ciblee_2019_fogg120.bnf.fr.warc.gz,295532254,20191212110808000
",content-size:2579

OpenWayback CDX indexer creates CDX lines for these records that 
OpenWayback playbacks very well.
I imagine th pywb also plays them without any problem.

The oldest revist can be very very old if the file hasn't changed and is 
still being crawled.
I don't know how old is our oldest (not before late 2016, since it came 
with NAS 5.2 :
https://sbforge.org/display/NAS/NetarchiveSuite+5.2.x+Release+Notes)

Regarding space saving, we have precise numbers :
For our 2019 focused crawls, we harvested 107,49TB of uncompressed data 
and didn't harvest 34,77TB we "saved" from deduplication (i.e. 24,5%).
For our 2019 broad crawl, we harvested 234,64TB of uncompressed data and 
didn't harvest 66,53TB we "saved" from deduplication (i.e 22%).  
So at our scale, deduplication saves a fourth of our storage, which is 
huge !

Sara

De :    "Peter Svanberg" <Peter.Svanberg at kb.se>
A :     "netarchivesuite-users at ml.sbforge.org" 
<netarchivesuite-users at ml.sbforge.org>
Date :  13/01/2020 23:05
Objet : Re: [Netarchivesuite-users] Questions about deduplication (and 
reduplication)
Envoyé par :    "NetarchiveSuite-users" 
<netarchivesuite-users-bounces at ml.sbforge.org>

Thanks, Sara!

So, when reduplicating, e.g. at Wayback or Pyweb usage, all potential 
revisit target files must be reachable – not a problem? Kristinn mentioned 
that generating indexes (of content) can take much longer as it have to 
look up in url indexes and open a lot of files. Something you (or others) 
have experienced?

Do you have any idea of how old the oldest revisit target to recent warc 
files could be? Five, maybe ten years, then?

And I add a fifth question:

5) How much space do you save – just approximately.

      Peter

13 jan. 2020 kl. 17:47 skrev "sara.aubry at bnf.fr" <sara.aubry at bnf.fr>:

Hi Peter,

For BnF,
1) yes
2) you probably mean focused crawls: yes 
3) URL
4) Only when we have a major change in the crawler or the data format. 
Which means, the least possible.
Because it really save a lot of space, and also because we don't care 
about intervals between WARC files: that's why WARC revisit records were 
made for.
Deduplication also sometimes incidentally restarts when the previous 
capture of a harvest is not finished (either at crawl stage or 
post-processing stage) or crashed.

Best,

Sara

De :        "Peter Svanberg" <Peter.Svanberg at kb.se>
A :        "netarchivesuite-users at ml.sbforge.org" 
<netarchivesuite-users at ml.sbforge.org>
Date :        13/01/2020 17:31
Objet :        [Netarchivesuite-users] Questions about deduplication (and 
reduplication)
Envoyé par :        "NetarchiveSuite-users" 
<netarchivesuite-users-bounces at ml.sbforge.org>

Hello!

I’m trying to understand how NAS and Heritrix handles deduplication, which 
lead to an internal discussion about the overall pros and cons of ditto. I 
then found Kristinn Sigurðsson’s interesting web archiving blog articles. 
He has written about de- and reduplication: 
https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html

Some short questions about the deduplication in NAS 
(is.hi.bok.deduplicator.DeDuplicator) that I would appreciate quick 
answers on (from all NAS user sites):

1)      Do you use deduplication for snapshot harvests (broad crawls)?
2)      Do you use deduplication for snapshot harvests?
3)      Which matching method do you use – DIGEST or URL?
4)      Do you “restart” the deduplication at intervals? How long 
intervals?

By (4) I mean you do a harvest with no deduplication, limiting the number 
of dependencies between WARC files. (Somewhat like total and incremental 
backups.) Maybe you just do deduplication between  the 2–3 steps in a 
broad crawl? Or between the last X broad crawls?

Regards, 
-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se

 _______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users

Exposition Tolkien, voyage en Terre du Milieu - du 22 octobre 2019 au 16 
février 2020 - BnF - François-Mitterrand
Avant d'imprimer, pensez à l'environnement.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users

Exposition  Tolkien, voyage en Terre du Milieu  - du 22 octobre 2019 au 16 février 2020 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20200114/ff0e89ef/attachment.html>