[Netarchivesuite-devel] Questions regarding WARC
Mikis Seth Sørensen
mss at statsbiblioteket.dk
Wed Apr 23 12:50:30 CEST 2014
Hi Sara
Please find my answers to your questions below. Feel free to create JIRA
issues for the new features.
Best
Mikis
On 4/16/14 3:00 PM, "sara.aubry at bnf.fr" <sara.aubry at bnf.fr> wrote:
>Hello everyone,
>
>As I mentionned yesterday, we're testing NAS with WARC extensively to
>prepare our big move to WARC.
>We have a few questions regarding the configuration you chose to
>implement:
>- are you using the default WARCArchiver from Heritrix
>(org.archive.crawler.writer.WARCWriterProcessor) or the one from NAS
>(dk.netarkivet.harvester.harvesting.WARCWriterProcessor) ?
>- from our tests, neither one is producing revisit records for
>duplicates:
>is that correct? Would that be complicated to change the WARCWriter from
>NAS to have some?
According to Søren it is not the WarcWriter which needs to be changed to
enable generation of revisit records, but the deduplication module. The
estimated effort to do this is 1-2 weeks. If the indexer should be changed
to use the revisit records, the work needed would be much greater, around
1-2 months.
>- we would also like to have prefix in metadata files (either
>BnF-1-1-metadata-1.warc or 1-1-metadata-BnF-1.warc). Would that be
>easy/possible to implement?
Yes, that should be quite easy. Should this settings be the same one used
for the (warc) file naming?
>Best,
>
>Sara
>
>
>Exposition ?t? 1914. Les derniers jours de l'ancien monde - du 25 mars
>au 3 ao?t 2014 - BnF - Fran?ois-Mitterrand Avant d'imprimer, pensez ?
>l'environnement.
More information about the Netarchivesuite-devel
mailing list