[Netarchivesuite-devel] Very large batch files

Søren Vejrup Carlsen svc at kb.dk
Wed Jun 15 10:33:45 CEST 2011

Hi Nicolas.
The full crawl-log is used for both the deduplication indices and crawl-log indices (used by the viewerproxies).
The deduplication indices also uses the cdxdata from the metadata arc-files.

Unless your harvest.jobs grows bigger, they will not grow any bigger.
I think the problem here, is that the data is uncompressed.  We should maybe think about compressing the data before transmitting it.

Best regards
PS: Please use netarchivesuite-devel at ml.sbforge.org<mailto:netarchivesuite-devel at ml.sbforge.org> instead of netarchivesuite-devel at lists.gforge.statsbiblioteket.dk

Fra: netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk] På vegne af Nicolas Giraud
Sendt: 14. juni 2011 11:10
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; christophe.yven at bnf.fr
Emne: [Netarchivesuite-devel] Very large batch files

Hi all,

We are experiencing a problem in our production environment at BnF. We have two very large (3.8GB)  files generated by Arc repository batches that saturate the temporary directory (see attached screenshot).

These files contain crawl log lines, but I have no way to know whether they are generated for the full crawl log index or the deduplication index. Furthermore I don't clearly understand what the full crawl log is intended for (the viewer proxy?).

Do we have to expect these files to grow in size as the contents of the ARC repository grows bigger? To fix our problem, can we safely delete these files?

Best regards,

Nicolas Giraud

Nicolas Giraud
Développeur Archives du Web - Bibliothèque Nationale de France
Web Archiving Developper - National Library of France
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20110615/c2b5a774/attachment.html>

More information about the Netarchivesuite-devel mailing list