[Netarchivesuite-users] Cope with product catalogs, large forums; calendars, file structure loops

Peter Svanberg Peter.Svanberg at kb.se
Tue Feb 4 11:22:02 CET 2020


After a couple of days of crawling: The number of objects limit trick didn't work. We have several jobs which remains for days working with just one domain, until reaching the limit (now 50.000 objects).

How do you solve that? Or is NAS solving it after the first crawl, putting all those domains in the same job (through the ambitious sorting and selection algorithms)?

Regards,

-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se

Från: Peter Svanberg
Skickat: den 31 januari 2020 11:25
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Cope with product catalogs, large forums; calendars, file structure loops

Hello,

Two sorts of sites with a lot of potentially unwanted URL:s


1)      Large product catalogues, large forums etc.

2)      Calendars (of events, for booking etc.) with no limits, file structure loops

For the first category you can chose to just crawl it - as long as it is finite. But do you, or do you try to limit such sites? How? A problem with crawling it is also that domains with such content often becomes the only domain left in a job, occupying it for hours or days with just one URL/s speed.
The second category must be stopped in some way. You can't safely figure out from the URL what is a calendar, and PathologicalPathDecideRule doesn't recognize loops in several steps (.../a/b/c/a/b/c...). TooManyPathSegmentsDecideRule can be used to stop it eventually but much can happen before that. The least bad way to stop it I can think of is through the number of objects limit. Or do you have other hints?

Regards,
-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20200204/b54c1ec2/attachment.html>


More information about the NetarchiveSuite-users mailing list