[Netarchivesuite-users] err=java.util.ConcurrentModificationException : a problem with de-duplication ?

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Jul 24 14:29:20 CEST 2013


Hello everyone,

By doing some QA on content we are collecting behind paywalls,
we notice that we sometime (it is not regular) get an error message in the 
crawl.log
which does not break the job but mix up the seeds, thus the object counts 
at the end of the crawl.

Here is an example : in the job, we had 2 domains :
- republicain-lorrain.fr => with 
http://www.republicain-lorrain.fr/e-services/Login as a seed
- leprogres.fr => with http://www.leprogres.fr/e-services/Login as a seed

At the very beginning, everything goes fine:
2013-07-14T10:05:10.092Z   200      27850 
http://www.republicain-lorrain.fr/fr/images/get.aspx?imedia=46892303 RLE 
http://www.republicain-lorrain.fr/fr/editions-numeriques/nos-supplements.html 
image/jpeg #117 20130714100509974+117 
sha1:7V3VST4HJSJNV5ZD46I37CBSXOKHGNCO 
http://www.republicain-lorrain.fr/e-services/Login 
duplicate:"6212-33-20130701100514-00001-BnF_ciblee_2013_gulliver120.bnf.fr.arc.gz,1787267",content-size:27985

Then we get this message:
2013-07-14T10:18:19.317Z    -5        208 
http://www.leprogres.fr/e-services/Login - - text/html #193 
20130714101819091+223 sha1:PPVEKMAZQDFXRVELWWZW4G2DLLZZM53J 
http://www.leprogres.fr/e-services/Login 
err=java.util.ConcurrentModificationException,4t

Then http://www.republicain-lorrain.fr/e-services/Login appears as a seed 
for leprogres.fr:
2013-07-14T10:18:22.722Z   200     218227 http://www.leprogres.fr/ RL 
http://www.republicain-lorrain.fr/ text/html #193 20130714101821322+860 
sha1:2QOEEA4FDQFHJ2UV2FPS5ND3ED7G3Y54 
http://www.republicain-lorrain.fr/e-services/Login content-size:218397
2013-07-14T10:18:26.228Z   200      24038 
http://www.leprogres.fr/fr/images/favicon.ico RLE http://www.leprogres.fr/ 
image/vnd.microsoft.icon #195 20130714101826166+60 
sha1:M6A3E6GBDPPTIZYK7PYRDT2BT24IIHFH 
http://www.republicain-lorrain.fr/e-services/Login 
duplicate:"6212-33-20130701100514-00001-BnF_ciblee_2013_gulliver120.bnf.fr.arc.gz,40045",content-size:24278
2013-07-14T10:18:28.481Z   200     170061 
http://www.leprogres.fr/fr/javascript/v3/jquery.pack.js RLE 
http://www.leprogres.fr/ application/javascript #195 20130714101828234+179 
sha1:QKMCFUEY5ZO4FWXZXNG4T6ZHCYNLSBQ3 
http://www.republicain-lorrain.fr/e-services/Login 
duplicate:"6212-33-20130701100514-00001-BnF_ciblee_2013_gulliver120.bnf.fr.arc.gz,47536",content-size:170316

Have you ever experienced this problem? Is it a NetarchiveSuite or a 
Heritrix problem?
As I said, this problem does not occur regularly.

Best,

Sara


Exposition  Zellidja, carnets de voyage  - prolongation jusqu'au 3 août 2013 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 


More information about the NetarchiveSuite-users mailing list