[Netarchivesuite-users] err=java.util.ConcurrentModificationException : a problem with de-duplication ?
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Wed Jul 24 14:29:20 CEST 2013
Hello everyone,
By doing some QA on content we are collecting behind paywalls,
we notice that we sometime (it is not regular) get an error message in the
crawl.log
which does not break the job but mix up the seeds, thus the object counts
at the end of the crawl.
Here is an example : in the job, we had 2 domains :
- republicain-lorrain.fr => with
http://www.republicain-lorrain.fr/e-services/Login as a seed
- leprogres.fr => with http://www.leprogres.fr/e-services/Login as a seed
At the very beginning, everything goes fine:
2013-07-14T10:05:10.092Z 200 27850
http://www.republicain-lorrain.fr/fr/images/get.aspx?imedia=46892303 RLE
http://www.republicain-lorrain.fr/fr/editions-numeriques/nos-supplements.html
image/jpeg #117 20130714100509974+117
sha1:7V3VST4HJSJNV5ZD46I37CBSXOKHGNCO
http://www.republicain-lorrain.fr/e-services/Login
duplicate:"6212-33-20130701100514-00001-BnF_ciblee_2013_gulliver120.bnf.fr.arc.gz,1787267",content-size:27985
Then we get this message:
2013-07-14T10:18:19.317Z -5 208
http://www.leprogres.fr/e-services/Login - - text/html #193
20130714101819091+223 sha1:PPVEKMAZQDFXRVELWWZW4G2DLLZZM53J
http://www.leprogres.fr/e-services/Login
err=java.util.ConcurrentModificationException,4t
Then http://www.republicain-lorrain.fr/e-services/Login appears as a seed
for leprogres.fr:
2013-07-14T10:18:22.722Z 200 218227 http://www.leprogres.fr/ RL
http://www.republicain-lorrain.fr/ text/html #193 20130714101821322+860
sha1:2QOEEA4FDQFHJ2UV2FPS5ND3ED7G3Y54
http://www.republicain-lorrain.fr/e-services/Login content-size:218397
2013-07-14T10:18:26.228Z 200 24038
http://www.leprogres.fr/fr/images/favicon.ico RLE http://www.leprogres.fr/
image/vnd.microsoft.icon #195 20130714101826166+60
sha1:M6A3E6GBDPPTIZYK7PYRDT2BT24IIHFH
http://www.republicain-lorrain.fr/e-services/Login
duplicate:"6212-33-20130701100514-00001-BnF_ciblee_2013_gulliver120.bnf.fr.arc.gz,40045",content-size:24278
2013-07-14T10:18:28.481Z 200 170061
http://www.leprogres.fr/fr/javascript/v3/jquery.pack.js RLE
http://www.leprogres.fr/ application/javascript #195 20130714101828234+179
sha1:QKMCFUEY5ZO4FWXZXNG4T6ZHCYNLSBQ3
http://www.republicain-lorrain.fr/e-services/Login
duplicate:"6212-33-20130701100514-00001-BnF_ciblee_2013_gulliver120.bnf.fr.arc.gz,47536",content-size:170316
Have you ever experienced this problem? Is it a NetarchiveSuite or a
Heritrix problem?
As I said, this problem does not occur regularly.
Best,
Sara
Exposition Zellidja, carnets de voyage - prolongation jusqu'au 3 août 2013 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.
More information about the NetarchiveSuite-users
mailing list