[Netarchivesuite-users] Cope with product catalogs, large forums; calendars, file structure loops
Peter.Svanberg at kb.se
Fri Jan 31 11:25:18 CET 2020
Two sorts of sites with a lot of potentially unwanted URL:s
1) Large product catalogues, large forums etc.
2) Calendars (of events, for booking etc.) with no limits, file structure loops
For the first category you can chose to just crawl it - as long as it is finite. But do you, or do you try to limit such sites? How? A problem with crawling it is also that domains with such content often becomes the only domain left in a job, occupying it for hours or days with just one URL/s speed.
The second category must be stopped in some way. You can't safely figure out from the URL what is a calendar, and PathologicalPathDecideRule doesn't recognize loops in several steps (.../a/b/c/a/b/c...). TooManyPathSegmentsDecideRule can be used to stop it eventually but much can happen before that. The least bad way to stop it I can think of is through the number of objects limit. Or do you have other hints?
National Library of Sweden
Phone: +46 10 709 32 78
E-mail: peter.svanberg at kb.se
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NetarchiveSuite-users