[Netarchivesuite-devel] FW: [netarchivesuite-Bugs][1172] password protected domain was not harvested
Colin Samuel Rosenthal
csr at statsbiblioteket.dk
Wed Jun 17 11:10:30 CEST 2009
I'm just forwarding this from the bug mailing list as I'd like comments on the proposed fix. It
appears (subject to sanity testing) that this bug is caused by max-retries being too low in our order_xml files. I therefore
propose raising it from 3 to 5 in all order-xml files. If anyone can think of any problems this might cause please let me know.
--
Colin
________________________________________
From: netarchivesuite-bugs at gforge.statsbiblioteket.dk [netarchivesuite-bugs at gforge.statsbiblioteket.dk]
Sent: Wednesday, June 17, 2009 10:46 AM
To: noreply at gforge.statsbiblioteket.dk
Subject: [netarchivesuite-Bugs][1172] password protected domain was not harvested
Bugs item #1172, was opened at 2007-12-18 15:49
You can respond by visiting:
http://gforge.statsbiblioteket.dk/tracker/?func=detail&atid=105&aid=1172&group_id=7
Or by replying to this e-mail entering your response between the following markers:
#+#+#+#+#+#+#+#+#+#+#+#+#+#+#+#+#+
(enter your response here)
#+#+#+#+#+#+#+#+#+#+#+#+#+#+#+#+#+
Status: Open
Priority: 4
Submitted By: Eld Zierau (elzi)
Assigned to: Colin Rosenthal (csrster)
Summary: password protected domain was not harvested
Module: Harvester
Version: None
Duplicate Of:
Status: In progress
Initial Comment:
TEST 1 in "Browse in data from the first event harvest only"
could not see password password pertected page
----------------------------------------------------------------------
>Comment By: Colin Rosenthal (csrster)
Date: 2009-06-17 10:46
Message:
This seems to be a configuration problem with our order xml's:
See
http://webarchive.jira.com/browse/HER-1376?focusedCommentId=20547&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_20547
----------------------------------------------------------------------
Comment By: Colin Rosenthal (csrster)
Date: 2009-06-15 14:31
Message:
I have added some more information to Sren's bug description in heritrix. This is almost certainly an heritrix problem so should probably be marked invalid.
What I don't understand is why authentication fails when we give the url to secret.txt as a seed.
----------------------------------------------------------------------
Comment By: Soeren Vejrup Carlsen (svc)
Date: 2007-12-18 18:49
Message:
It turned out, that we have ordered Heritrix to only harvest the frontpage of the seeds, and then added the secret page as one of the seeds. But heritrix does not crawl a password protected seed, even it has the proper credentials!
In this case, the seedsreport reports, that the seed has been crawled with status 401!
http://webteam.archive.org/jira/browse/HER-1376
With this bug in Heritrix, we can't harvest and show the url: www.kaarefc.dk/private/secret.txt as suggested by the the releasetest.
This is because Heritrix can't parse the link to secret.txt on the /private page properly.
But we can show the page www.kaarefc.dk/private/
----------------------------------------------------------------------
Comment By: Soeren Vejrup Carlsen (svc)
Date: 2007-12-18 15:54
Message:
Maybe an error in the releasetest description.
Needs to be investigated further.
----------------------------------------------------------------------
You can respond by visiting:
http://gforge.statsbiblioteket.dk/tracker/?func=detail&atid=105&aid=1172&group_id=7
More information about the Netarchivesuite-devel
mailing list