[Netarchivesuite-users] Generating CDX for duplicate entries
nicolas.giraud at bnf.fr
nicolas.giraud at bnf.fr
Tue May 5 10:16:47 CEST 2009
Hi,
Here at BnF I have a requirement from the librarians, that if we harvest a
given website say every week, then the end user browsing the archive,
should see that periodicity. We are using Wayback Machine to access the
archive, and currently I am generating CDX files from the ARC files
located on the main BitArchive. However these CDX files will not contain
entries for the "revisits" of resources whose checksum has not changed,
which does not meet the requirement.
Hence I am looking for a way to generate complementary "revisit" CDX
files. To keep things manageable, I intend to do it on a per-job basis. I
was expecting some kind of job report that would list what duplicate
entries were not harvested, but I see no such report.
What information is available on deduplication, once the job is complete?
I know that you are also studying interfacing NAS with Wayback, I would be
very interested to be associated to discussions and development.
Best,
Nicolas
Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090505/420c53fc/attachment-0002.html>
More information about the NetarchiveSuite-users
mailing list