[Netarchivesuite-users] Questions about deduplication (and reduplication)

Tue Jan 14 11:36:06 CET 2020

(Should have been ”how old the oldest revisit target VALUES IN recent warc files” i.e. are there long-time dependencies?)

I repeat the questions, corrected and gathered:

1)      Do you use deduplication for snapshot harvests (broad crawls)?
2)      Do you use deduplication for selective (focused) harvests?
3)      Which matching method do you use – DIGEST or URL?
4)      Do you “restart” the deduplication at intervals? How long intervals?
5)     How much space do you save – just approximately?

When looking at the code for the choices URL or DIGEST  for matching method it seems as the choice shouldn’t affect the result – the difference is only which field is used for the index lookup, booth must still match (or almost match, for URL). But maybe I’ve missed something? Or else, why this choice? Depends on choice of index database? (Not important, I’m just curious.)

Regards,

Peter

Från: Peter Svanberg <Peter.Svanberg at kb.se>
Skickat: den 13 januari 2020 23:06
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] Questions about deduplication (and reduplication)

Thanks, Sara!
So, when reduplicating, e.g. at Wayback or Pyweb usage, all potential revisit target files must be reachable – not a problem? Kristinn mentioned that generating indexes (of content) can take much longer as it have to look up in url indexes and open a lot of files. Something you (or others) have experienced?

Do you have any idea of how old the oldest revisit target to recent warc files could be? Five, maybe ten years, then?

And I add a fifth question:

5) How much space do you save – just approximately.

      Peter

13 jan. 2020 kl. 17:47 skrev "sara.aubry at bnf.fr<mailto:sara.aubry at bnf.fr>" <sara.aubry at bnf.fr<mailto:sara.aubry at bnf.fr>>:
Hi Peter,

For BnF,
1) yes
2) you probably mean focused crawls: yes
3) URL
4) Only when we have a major change in the crawler or the data format. Which means, the least possible.
Because it really save a lot of space, and also because we don't care about intervals between WARC files: that's why WARC revisit records were made for.
Deduplication also sometimes incidentally restarts when the previous capture of a harvest is not finished (either at crawl stage or post-processing stage) or crashed.

Best,

Sara

De :        "Peter Svanberg" <Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>>
A :        "netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>" <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>
Date :        13/01/2020 17:31
Objet :        [Netarchivesuite-users] Questions about deduplication (and reduplication)
Envoyé par :        "NetarchiveSuite-users" <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>>
________________________________

Hello!

I’m trying to understand how NAS and Heritrix handles deduplication, which lead to an internal discussion about the overall pros and cons of ditto. I then found Kristinn Sigurðsson’s interesting web archiving blog articles. He has written about de- and reduplication: https://kris-sigur.blogspot.com/2015/01/the-downside-of-web-archive.html

Some short questions about the deduplication in NAS (is.hi.bok.deduplicator.DeDuplicator) that I would appreciate quick answers on (from all NAS user sites):

1)      Do you use deduplication for snapshot harvests (broad crawls)?
2)      Do you use deduplication for snapshot harvests?
3)      Which matching method do you use – DIGEST or URL?
4)      Do you “restart” the deduplication at intervals? How long intervals?

By (4) I mean you do a harvest with no deduplication, limiting the number of dependencies between WARC files. (Somewhat like total and incremental backups.) Maybe you just do deduplication between  the 2–3 steps in a broad crawl? Or between the last X broad crawls?

Regards,
-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se

 _______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
________________________________

Exposition Tolkien, voyage en Terre du Milieu<https://www.bnf.fr/fr/agenda/tolkien-voyage-en-terre-du-milieu> - du 22 octobre 2019 au 16 février 2020 - BnF - François-Mitterrand

Avant d'imprimer, pensez à l'environnement.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20200114/aa1a128a/attachment-0001.html>