[Netarchivesuite-users] Generating CDX for duplicate entries

Søren Vejrup Carlsen svc at kb.dk
Tue May 5 18:39:15 CEST 2009


Hi Nicolas.

The information about the objects being deduplicated is written to the crawl.log along with a comment (format: "deduplicate:arcfile,offset") where the existing copy of this object in the archive can be found. 

We later when creating indexes for browsing the harvested material merge the information in the cdx'es with the deduplication information 

 

As for work done of integration between NAS and Wayback, please contact Colin Rosenthal (csr at statsbiblioteket.dk)

for further information. I know that he has been working on a connector between Wayback and a NAS style archive.

 

I hope this answers your question. Otherwise we can discuss it further at the workshop.

 

Søren

 

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af nicolas.giraud at bnf.fr
Sendt: 5. maj 2009 10:17
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Generating CDX for duplicate entries

 


Hi,

Here at BnF I have a requirement from the librarians, that if we harvest a given website say every week, then the end user browsing the archive,  should see that periodicity. We are using Wayback Machine to access the archive, and currently I am generating CDX files from the ARC files located on the main BitArchive. However these CDX files will not contain entries for the "revisits" of resources whose checksum has not changed, which does not meet the requirement.

Hence I am looking for a way to generate complementary "revisit" CDX files. To keep things manageable, I intend to do it on a per-job basis. I was expecting some kind of job report that would list what duplicate entries were not harvested, but I see no such report. 

What information is available on deduplication, once the job is complete? I know that you are also studying interfacing NAS with Wayback, I would be very interested to be associated to discussions and development.

Best,

Nicolas


Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090505/7a9dfc40/attachment-0002.html>


More information about the NetarchiveSuite-users mailing list