[Netarchivesuite-devel] Questions regarding WARC

Mikis Seth Sørensen mss at statsbiblioteket.dk
Wed Apr 23 12:50:30 CEST 2014


Hi Sara 

Please find my answers to your questions below. Feel free to create JIRA
issues for the new features.

Best 
Mikis

On 4/16/14 3:00 PM, "sara.aubry at bnf.fr" <sara.aubry at bnf.fr> wrote:

>Hello everyone,
>
>As I mentionned yesterday, we're testing NAS with WARC extensively to
>prepare our big move to WARC.
>We have a few questions regarding the configuration you chose to
>implement:
>- are you using the default WARCArchiver from Heritrix
>(org.archive.crawler.writer.WARCWriterProcessor) or the one from NAS
>(dk.netarkivet.harvester.harvesting.WARCWriterProcessor) ?
>- from our tests, neither one is producing revisit records for
>duplicates: 
>is that correct? Would that be complicated to change the WARCWriter from
>NAS to have some?
According to Søren it is not the WarcWriter which needs to be changed to
enable generation of revisit records, but the deduplication module. The
estimated effort to do this is 1-2 weeks. If the indexer should be changed
to use the revisit records, the work needed would be much greater, around
1-2 months. 

>- we would also like to have prefix in metadata  files (either
>BnF-1-1-metadata-1.warc or 1-1-metadata-BnF-1.warc). Would that be
>easy/possible to implement?
Yes, that should be quite easy. Should this settings be the same one used
for the (warc) file naming?
>Best,
>
>Sara
>
>
>Exposition  ?t? 1914. Les derniers jours de l'ancien monde  - du 25 mars
>au 3 ao?t 2014 - BnF - Fran?ois-Mitterrand Avant d'imprimer, pensez ?
>l'environnement. 




More information about the Netarchivesuite-devel mailing list