[Netarchivesuite-users] Troubleshooting second stage and deduplication

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Feb 24 18:38:04 CET 2010


Hello everyone,

We just launched the second stage of our broad crawl (still testing).

At 16:30, the IndexServer started to build the deduplication index. 
It is based on:
1349 jobs, 
3,7 Tb of data (ARC files), 
87 Gb of metadata (metadata ARC files).

1,4 mio domains were completed in the first stage,
176 000 domains reached the max object limit,
74 000 domains have a "harvest aborted" as stop reason.

1)  Will all my 176 000 + 74 000 domains be in the second stage? 
Just to be sure, I think the answer is yes.

2) How much disk space do we need to store the working cache files and the 
target deduplication index?
For now, we have a 70Gb partition which is 90% full...
Could you re-explain us the process of creation of this index: from which 
jobs and for what jobs it is created, 
is it one or several different indices, where it is stored (centrally or 
locally on the crawlers),...
Do you have stats on the size of your index? 

3) We are running through many different errors:
- NS lost connexion to the database system,
- some jobs started and are still running,

- lots of jobs are failing with the following reason:

Tr : Netarkivet error: Trouble during postprocessing of files in 
'/bnf/netarchivesuite/bin/MAB2/jobs/current/low/1390_1267026208952'. 
Errors accumulated during the postprocessing: IOFailure occurred, while 
trying to upload files
atlas507.bnf.fr
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer.processHarvestInfoFile(HarvestControllerServer.java:569)
Trouble during postprocessing of files in 
'/bnf/netarchivesuite/bin/MAB2/jobs/current/low/1390_1267026208952'. 
Errors accumulated during the postprocessing: IOFailure occurred, while 
trying to upload files

dk.netarkivet.common.exceptions.IOFailure: IOFailure occurred, while 
trying to upload files
                 at 
dk.netarkivet.harvester.harvesting.HarvestController.storeFiles(HarvestController.java:289)
                 at 
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer.processHarvestInfoFile(HarvestControllerServer.java:561)
                 at 
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer.processOldJobs(HarvestControllerServer.java:282)
                 at 
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer.access$400(HarvestControllerServer.java:87)
                 at 
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:674)
Caused by: dk.netarkivet.common.exceptions.IOFailure: 
jobs/current/low/1390_1267026208952/arcs is not a directory
                 at 
dk.netarkivet.harvester.harvesting.IngestableFiles.getArcFiles(IngestableFiles.java:244)
                 at 
dk.netarkivet.harvester.harvesting.HarvestController.storeFiles(HarvestController.java:274)
                 ... 4 more
--------------------------------------------------------
 - other jobs are failing with another reason:

Tr : Netarkivet error: Fatal error while operating job 'Job 1390 (state = 
SUBMITTED, HD = 4, priority = LOWPRIORITY, forcemaxcount = 10000, 
forcemaxbytes = -1, orderxml = default, numconfigs = 98)'
atlas507.bnf.fr
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:670)
Fatal error while operating job 'Job 1390 (state = SUBMITTED, HD = 4, 
priority = LOWPRIORITY, forcemaxcount = 10000, forcemaxbytes = -1, 
orderxml = default, numconfigs = 98)'
dk.netarkivet.common.exceptions.IOFailure: Timeout waiting for reply of 
index request for jobs 
3,4,5,6,7,8,9,10,11,12,13,14,15,17,16,19,18,21,20,23,22,25,24,27,26,29,28,31,30,34,35,32,33,38,39,36,37,42,43,40,41,46,47,44,45,51,50,49,48,55,54,53,52,59,58,57,56,63,62,61,60,68,69,70,71,64,65,66,67,76,77,78,79,72,73,74,75,85,84,87,86,81,80,83,82,93,92,95,94,89,88,91,90,102,103,100,101,98,99,96,97,110,111,108,109,106,107,104,105,119,118,117,116,115,114,113,112,127,126,125,124,123,122,121,120,137,136,139,138,141,140,143,142,129,128,131,130,133,132,135,134,152,153,154,155,156,157,158,159,144,145,146,147,148,149,150,151,171,170,169,168,175,174,173,172,163,162,161,160,167,166,165,164,186,187,184,185,190,191,188,189,178,179,176,177,182,183,180,181,205,204,207,206,201,200,203,202,197,196,199,198,193,192,195,194,220,221,222,223,216,217,218,219,212,213,214,215,208,209,210,211,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225,224,254,255,252,253,250,251,248,249,246,247,244,245,242,243,240,241,275,274,273,272,[...],1058,1059,1056,1057,1062,1063,1060,1061,1083,1082,1081,1080,1087,1086,1085,1084,1075,1074,1073,1072,1079,1078,1077,1076,1221,1220,1223,1222,1217,1216,1219,1218,1229,1228,1231,1230,1225,1224,1227,1226,1236,1237,1238,1239,1232,1233,1234,1235,1244,1245,1246,1247,1240,1241,1242,1243,1255,1254,1253,1252,1251,1250,1249,1248,1263,1262,1261,1260,1259,1258,1257,1256,1270,1271,1268,1269,1266,1267,1264,1265,1278,1279,1276,1277,1274,1275,1272,1273,1153,1152,1155,1154,1157,1156,1159,1158,1161,1160,1163,1162,1165,1164,1167,1166,1168,1169,1170,1171,1172,1173,1174,1175,1176,1177,1178,1179,1180,1181,1182,1183,1187,1186,1185,1184,1191,1190,1189,1188,1195,1194,1193,1192,1199,1198,1197,1196,1202,1203,1200,1201,1206,1207,1204,1205,1210,1211,1208,1209,1214,1215,1212,1213,1350,1351,1348,1349,1346,1347,1344,1345,1307,1306,1305,1304,1311,1310,1309,1308,1299,1298,1297,1296,1303,1302,1301,1300,1290,1291,1288,1289,1294,1295,1292,1293,1282,1283,1280,1281,1286,1287,1284,1285,1337,1336,1339,1338,1341,1340,1343,1342,1329,1328,1331,1330,1333,1332,1335,1334,1320,1321,1322,1323,1324,1325,1326,1327,1312,1313,1314,1315,1316,1317,1318,1319
                 at 
dk.netarkivet.archive.indexserver.distribute.IndexRequestClient.checkMessageValid(IndexRequestClient.java:298)
                 at 
dk.netarkivet.archive.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:180)
                 at 
dk.netarkivet.archive.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:59)
                 at 
dk.netarkivet.archive.indexserver.FileBasedCache.cache(FileBasedCache.java:164)
                 at 
dk.netarkivet.archive.indexserver.FileBasedCache.getIndex(FileBasedCache.java:232)
                 at 
dk.netarkivet.archive.indexserver.distribute.IndexRequestClient.getIndex(IndexRequestClient.java:59)
                 at 
dk.netarkivet.harvester.harvesting.HarvestController.fetchDeduplicateIndex(HarvestController.java:416)
                 at 
dk.netarkivet.harvester.harvesting.HarvestController.writeHarvestFiles(HarvestController.java:163)
                 at 
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:642)

Any help would be great!

Sara





Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   



More information about the NetarchiveSuite-users mailing list