[Netarchivesuite-users] About the crawler traps

bert.wendland at bnf.fr bert.wendland at bnf.fr
Tue Aug 12 10:43:33 CEST 2025


Hello Miguel,

It seems that you tried to add every single URL that caused a 404 as a 
crawler trap. That's way too much. Use regular expressions instead.

Best regards,
  Bert
-- 
Ingénieur de production pour l'archivage de l'internet
Département des systèmes d'information
Bibliothèque nationale de France
Quai François-Mauriac
75706 Paris Cedex 13
Tél. : 01 53 79 45 58




De :    "Soleto Ruiz de Clavijo, Miguel" <miguel.soleto at externos.bne.es>
A :     "netarchivesuite-users at ml.sbforge.org" 
<netarchivesuite-users at ml.sbforge.org>
Date :  12/08/2025 09:18
Objet : [Netarchivesuite-users] About the crawler traps
Envoyé par :    "NetarchiveSuite-users" 
<netarchivesuite-users-bounces at ml.sbforge.org>



Dear all,
 
I have a question about traps. We have identified thousands of 404 codes 
in our crawls and want to add them as traps in the harvest. However, there 
are over 23,000 of them, and when I try to save them, I get a 502 error.
Is there any way to add all these traps?
 
Thank you very much in advance for your help.
 
Best regards,
 
Miguel.
 
Este mensaje y cualquier fichero adjunto están dirigidos únicamente a sus 
destinatarios y contiene información confidencial. Si usted ha recibido 
este correo electrónico por error, le informamos que no puede realizar 
ninguna revisión, alteración, impresión, copia, transmisión, difusión ni 
utilización alguna de este mensaje ni de cualquier fichero adjunto que 
pudiese contener. La realización de cualquiera de los actos indicados está 
expresamente prohibida por las Normas que regulan estas materias. Por todo 
ello se solicita que, en caso de existir error en la recepción de este 
mensaje, se lo notifique al remitente respondiendo a este e-mail y elimine 
el mensaje y su contenido inmediatamente. La Biblioteca Nacional de España 
se reserva las acciones legales que le correspondan en el caso de que se 
infrinja lo indicado anteriormente. The information in this e-mail and any 
attachments is confidential and it is intended for the addressee only. If 
you have received this e-mail in error, you are notified that any 
revision, amendment, print, copy, disclosure, distribution or use of the 
contents is unauthorized. Carrying out any of the above actions, is 
expressly banned by rules governing this matter. Hence we request that if 
you are not the intended recipient, please notify the sender answering 
this e-mail, and delete the message and any attachments. The National 
Library of Spain reserves itself the right to take the appropriate legal 
actions in the event of the above mentioned matter is being infringed. 
[pièce jointe "attrxbop.txt" supprimée par Bert WENDLAND/ETS/BnF] 


Venez découvrir le  le musée de la BnF à Richelieu . Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20250812/89ffe631/attachment.html>


More information about the NetarchiveSuite-users mailing list