[Netarchivesuite-users] Cope with product catalogs, large forums; calendars, file structure loops

Peter Svanberg Peter.Svanberg at kb.se
Fri Jan 31 11:25:18 CET 2020


Two sorts of sites with a lot of potentially unwanted URL:s

1)      Large product catalogues, large forums etc.

2)      Calendars (of events, for booking etc.) with no limits, file structure loops

For the first category you can chose to just crawl it - as long as it is finite. But do you, or do you try to limit such sites? How? A problem with crawling it is also that domains with such content often becomes the only domain left in a job, occupying it for hours or days with just one URL/s speed.
The second category must be stopped in some way. You can't safely figure out from the URL what is a calendar, and PathologicalPathDecideRule doesn't recognize loops in several steps (.../a/b/c/a/b/c...). TooManyPathSegmentsDecideRule can be used to stop it eventually but much can happen before that. The least bad way to stop it I can think of is through the number of objects limit. Or do you have other hints?


Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20200131/fd36347c/attachment.html>

More information about the NetarchiveSuite-users mailing list