[Netarchivesuite-users] NAS and HTTP redirections

Mon Jan 25 16:45:39 CET 2010

Hi Sara.

You're just right about the current way of working with NAS. In case of a redirect to another domain NAS will note the redirect as the only object for the original domain thus giving it status "Domain Completed".

For our broadcrawls we don't actually allow redirects as new seeds (heritrix setting) because we realized that too many domains redirected to domains outside ".dk" not of interest to us - and if you allow redirects as new seeds all redirects that go to domains already known (that were most of our domains doing redirect) would have those domains harvested more than once - because those domains were also started on their own.

To avoid missing out of still important domains we did a crawl of only frontpages (actually done with a simple bash-script and the wget tool) of all known domains - finding exactly those redirecting to other domains.

The complete list of "other domains" were analysed through GeoIP to find the geographical location of the webserver - and those matching "DK Denmark" were added as new domains to our system.

Off cause this exercise should be repeated most likely before each new broad crawl - and I actually don't think we have done this our selves.

It would be quite complicated to implement NAS to know of redirect and statistics related to that. Especially because the domains redirected to are most likely not known in the system.

-
Bjarne

-----Oprindelig meddelelse-----
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af sara.aubry at bnf.fr
Sendt: 25. januar 2010 16:38
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk; netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] NAS and HTTP redirections

Hi all,

We're moving foward and getting close to our broad crawl.
We spent a while analyzing stats and stop reasons linked to specific domains within a job and found out that HTTP redirect (like bikealot.fr goes to bikealot.eu), DNS no-reply (cesar-et-ses-cartes.fr) and HTTP errors (criminologic.fr) are given "Domain Completed" as stop reasons.

It makes sense for DNS no-reply and HTTP errors, but it's quite different for HTTP redirect which we want to collect beyond the first step, using the steps system ("
Harvest only
domains that were not completely harvest in a previous harvest:" 
checkbox).

How do you manage crawls for these specific domains?
How do you gather stats on these domains?

Thanks for your help!

Sara

Avant d'imprimer, pensez ? l'environnement. 
Consider the environment before printing this mail.