[Netarchivesuite-devel] Very large batch files

Søren Vejrup Carlsen svc at kb.dk
Wed Jun 15 13:32:38 CEST 2011


Hi Sara.
Some of the functionality of the indexserver is also a black box for me, alas.

However, the batch*TWO names are temporary files for batch-jobs in progress started by the LocalArcRepositoryClient.batch method, that runs a batchjob on the arcrepository directories declared in the settings (common.arcrepositoryClient.fileDir). 

The indexserver gets its rawdata from the arcrepository by sending of batchjobs from classes extending RawMetadataCache (which embeds the GetMetadataArcbatchjob).

The Raw crawllog data is placed only in the FULL_CRAWL_LOG
The other crawl-log caches seems to be lucene-indices.
 
I believe, that you can safely delete them now, if nobody is reading from them or writing to them.
In any case, you don't need to restart any of the pilot application.


I hope this helps.
If not, please say so.

/Søren
-----Oprindelig meddelelse-----
Fra: netarchivesuite-devel-bounces at ml.sbforge.org [mailto:netarchivesuite-devel-bounces at ml.sbforge.org] På vegne af sara.aubry at bnf.fr
Sendt: 15. juni 2011 11:15
Til: netarchivesuite-devel at ml.sbforge.org
Cc: bert.wendland at bnf.fr; christophe.yven at bnf.fr
Emne: Re: [Netarchivesuite-devel] Very large batch files

Hi Søren,

Many thanks for your answer. Just to make things clearer in our heads, can 
you confirm on these points?

- are crawl logs used by deduplication indices, first generated in /tmp 
and then moved to cache/crawllog?
- are crawl logs used by the viewer proxy, first generated in /tmp and 
then moved to cache/FULL_CRAWL_LOG?
- why do they have weird names such as batch5383846097867971133TWO?
- we have two of these files  batch5383846097867971133TWO and 
batch4763126150062911212TWO which are part of the same crawl.log extracted 
from a quite big job (15 mio URL and 478GB). Why two??
- can we safely delete these batch* files that saturated our /tmp without 
restarting the 4 pilot applications?
- if not, which one of the 4 needs to be restarted?


Sara








Message de : Søren Vejrup Carlsen <svc at kb.dk> 
                      15/06/2011 10:33

Envoyé par : 
<netarchivesuite-devel-bounces at ml.sbforge.org>

Veuillez répondre à <netarchivesuite-devel at ml.sbforge.org>



Pour
"netarchivesuite-devel at ml.sbforge.org" 
<netarchivesuite-devel at ml.sbforge.org>
Copie

Objet
Re: [Netarchivesuite-devel] Very large batch files



Hi Nicolas.
The full crawl-log is used for both the deduplication indices and 
crawl-log indices (used by the viewerproxies).
The deduplication indices also uses the cdxdata from the metadata 
arc-files.
 
Unless your harvest.jobs grows bigger, they will not grow any bigger. 
I think the problem here, is that the data is uncompressed.  We should 
maybe think about compressing the data before transmitting it.
 
Best regards
Søren
PS: Please use netarchivesuite-devel at ml.sbforge.org instead of 
netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
 
Fra: netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk 
[mailto:netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk] På 
vegne af Nicolas Giraud
Sendt: 14. juni 2011 11:10
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; christophe.yven at bnf.fr
Emne: [Netarchivesuite-devel] Very large batch files
 
Hi all,

We are experiencing a problem in our production environment at BnF. We 
have two very large (3.8GB)  files generated by Arc repository batches 
that saturate the temporary directory (see attached screenshot). 

These files contain crawl log lines, but I have no way to know whether 
they are generated for the full crawl log index or the deduplication 
index. Furthermore I don't clearly understand what the full crawl log is 
intended for (the viewer proxy?). 

Do we have to expect these files to grow in size as the contents of the 
ARC repository grows bigger? To fix our problem, can we safely delete 
these files?

Best regards,

Nicolas Giraud

-- 
Nicolas Giraud
---------------------------------------------------------------------------------------------
Développeur Archives du Web - Bibliothèque Nationale de France
Web Archiving Developper - National Library of France
---------------------------------------------------------------------------------------------
_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel




Exposition  Richard Prince. American Prayer  - du 29 mars 2011 au 26 juin 2011 - BnF - François-Mitterrand / Grande Galerie  Avant d'imprimer, pensez à l'environnement. 
_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel



More information about the Netarchivesuite-devel mailing list