[Netarchivesuite-users] Problem with QA

Meelis Mihhailov meelis at nlib.ee
Mon Dec 3 16:17:20 CET 2012


Hi Søren and thank you for the reply.

I've added the seeds.txt and crawler configuration as attachments to 
this mail. I hope this mailing list supports attachments.
I must add that the configuration is not 100% the same that came with 
NAS. I migrated our previous domainscope configuration to it and it has 
worked well for the last half a year.
I have scheduled NAS restart to the later evening today.

We are indeed using 3.21.0 to start archiving in warc format. We have 
been waiting a long time for the support and just havent yet updated our 
configurations. It's on the list of things to do while moving over to 
warc files :)

Meelis Mihhailov
National Library Of Estonia
meelis at nlib.ee




On 3.12.2012 16:43, Søren Vejrup Carlsen wrote:
> Hi Meelis.
> Without more information about your crawlsetup (order.xml and seeds.txt), and the domains being harvested
> we cannot really help you much. Could you send us a couple of the domains (with start-seeds) that you're harvesting? Then we could see if the problem is a configuration problem or not.
>
> BTW why are you running a development release of Netarchivesuite? The only reason for running 3.21.0 is the support of using WARC as archival format instead of ARC. If you're not interested in that, you would probably be
> better of running the 3.20 release.
>
> Best Regards
> Søren Vejrup Carlsen
> Developer and QA of NetarchiveSuite
>
> -----Oprindelig meddelelse-----
> Fra: netarchivesuite-users-bounces at ml.sbforge.org [mailto:netarchivesuite-users-bounces at ml.sbforge.org] På vegne af Meelis Mihhailov
> Sendt: 3. december 2012 13:40
> Til: Netarchive Suite Users
> Emne: [Netarchivesuite-users] Problem with QA
>
> Hi all!
>
> I have a problem with NAS 3.21.0 QA indexing.
> We use two configurations for our crawl, one with max-hops=25 and the other with max-hops=0.
>
> Everything worked fine until now. When we create an index for the crawl in order to do QA all the main addresses return "not found" errors. I mean www.server.com are not found but all other that point to resource (.js, .css or images and files) are displayed OK.
>
> This does not affect the links that are crawled with max-hops=0.
>
> Can anyone help me figure out what is wrong? All logs show that the main domain is crawled. All ARC files contain the content that is fetched when www.server.com is crawled and index segments show that the resource is there and points to a correct ARC file.
>
> At the moment I havent restarted NAS as we are currently in the middle of the crawl.
>
>
> Meelis Mihhailov
> National Library Of Estonia
> meelis at nlib.ee
>
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at ml.sbforge.org
> http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
>
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at ml.sbforge.org
> http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users

-------------- next part --------------
http://www.nyc.estemb.org/
http://www.estemb.it/
http://www.embest.pt/
http://www.estemb.de/
http://www.estemb.ie/
http://www.estemb.ru/
http://www.estemb.hu/
http://www.peterburg.estemb.ru/
http://www.estemb.es/
http://www.estemb.kiev.ua/
http://www.estemb.by/
http://www.estemb.pl/
http://www.estemb.or.jp/
http://www.estemb.cz/
http://www.estemb.be/
http://www.estemb.se/
http://www.estemb.lt/
http://www.estemb.org/
http://www.estemb.lv/
http://www.ceest.sdv.fr/
http://www.estemb.org.tr/
http://www.est-emb.fr/
http://un.estemb.org/
http://www.estemb.no/
http://www.estemb.ca/
http://www.estonia.gov.uk/
http://www.pihkva.estemb.ru/
http://www.estemb.gr/
http://www.estemb.fi/
http://www.estemb.nl/
http://www.estemb.at/
http://www.estemb.dk/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PEAMINE_TIRIMISE_PROFIIL.xml
Type: text/xml
Size: 28846 bytes
Desc: not available
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20121203/df8afaa7/attachment-0001.xml>


More information about the NetarchiveSuite-users mailing list