[Netarchivesuite-devel] Very large batch files

Wed Jun 15 11:15:13 CEST 2011

Hi Søren,

Many thanks for your answer. Just to make things clearer in our heads, can 
you confirm on these points?

- are crawl logs used by deduplication indices, first generated in /tmp 
and then moved to cache/crawllog?
- are crawl logs used by the viewer proxy, first generated in /tmp and 
then moved to cache/FULL_CRAWL_LOG?
- why do they have weird names such as batch5383846097867971133TWO?
- we have two of these files  batch5383846097867971133TWO and 
batch4763126150062911212TWO which are part of the same crawl.log extracted 
from a quite big job (15 mio URL and 478GB). Why two??
- can we safely delete these batch* files that saturated our /tmp without 
restarting the 4 pilot applications?
- if not, which one of the 4 needs to be restarted?

Sara

Message de : Søren Vejrup Carlsen <svc at kb.dk> 
                      15/06/2011 10:33

Envoyé par : 
<netarchivesuite-devel-bounces at ml.sbforge.org>

Veuillez répondre à <netarchivesuite-devel at ml.sbforge.org>

Pour
"netarchivesuite-devel at ml.sbforge.org" 
<netarchivesuite-devel at ml.sbforge.org>
Copie

Objet
Re: [Netarchivesuite-devel] Very large batch files

Hi Nicolas.
The full crawl-log is used for both the deduplication indices and 
crawl-log indices (used by the viewerproxies).
The deduplication indices also uses the cdxdata from the metadata 
arc-files.

Unless your harvest.jobs grows bigger, they will not grow any bigger. 
I think the problem here, is that the data is uncompressed.  We should 
maybe think about compressing the data before transmitting it.

Best regards
Søren
PS: Please use netarchivesuite-devel at ml.sbforge.org instead of 
netarchivesuite-devel at lists.gforge.statsbiblioteket.dk

Fra: netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk 
[mailto:netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk] På 
vegne af Nicolas Giraud
Sendt: 14. juni 2011 11:10
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; christophe.yven at bnf.fr
Emne: [Netarchivesuite-devel] Very large batch files

Hi all,

We are experiencing a problem in our production environment at BnF. We 
have two very large (3.8GB)  files generated by Arc repository batches 
that saturate the temporary directory (see attached screenshot). 

These files contain crawl log lines, but I have no way to know whether 
they are generated for the full crawl log index or the deduplication 
index. Furthermore I don't clearly understand what the full crawl log is 
intended for (the viewer proxy?). 

Do we have to expect these files to grow in size as the contents of the 
ARC repository grows bigger? To fix our problem, can we safely delete 
these files?

Best regards,

Nicolas Giraud

-- 
Nicolas Giraud
---------------------------------------------------------------------------------------------
Développeur Archives du Web - Bibliothèque Nationale de France
Web Archiving Developper - National Library of France
---------------------------------------------------------------------------------------------
_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel

Exposition  Richard Prince. American Prayer  - du 29 mars 2011 au 26 juin 2011 - BnF - François-Mitterrand / Grande Galerie  Avant d'imprimer, pensez à l'environnement.