[Netarchivesuite-users] Generating CDX for duplicate entries

nicolas.giraud at bnf.fr nicolas.giraud at bnf.fr
Tue May 5 10:16:47 CEST 2009


Hi,

Here at BnF I have a requirement from the librarians, that if we harvest a 
given website say every week, then the end user browsing the archive, 
should see that periodicity. We are using Wayback Machine to access the 
archive, and currently I am generating CDX files from the ARC files 
located on the main BitArchive. However these CDX files will not contain 
entries for the "revisits" of resources whose checksum has not changed, 
which does not meet the requirement.

Hence I am looking for a way to generate complementary "revisit" CDX 
files. To keep things manageable, I intend to do it on a per-job basis. I 
was expecting some kind of job report that would list what duplicate 
entries were not harvested, but I see no such report. 

What information is available on deduplication, once the job is complete? 
I know that you are also studying interfacing NAS with Wayback, I would be 
very interested to be associated to discussions and development.

Best,

Nicolas




Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090505/420c53fc/attachment-0002.html>


More information about the NetarchiveSuite-users mailing list