[Netarchivesuite-users] About the crawler traps
bert.wendland at bnf.fr
bert.wendland at bnf.fr
Tue Aug 12 10:43:33 CEST 2025
Hello Miguel,
It seems that you tried to add every single URL that caused a 404 as a
crawler trap. That's way too much. Use regular expressions instead.
Best regards,
Bert
--
Ingénieur de production pour l'archivage de l'internet
Département des systèmes d'information
Bibliothèque nationale de France
Quai François-Mauriac
75706 Paris Cedex 13
Tél. : 01 53 79 45 58
De : "Soleto Ruiz de Clavijo, Miguel" <miguel.soleto at externos.bne.es>
A : "netarchivesuite-users at ml.sbforge.org"
<netarchivesuite-users at ml.sbforge.org>
Date : 12/08/2025 09:18
Objet : [Netarchivesuite-users] About the crawler traps
Envoyé par : "NetarchiveSuite-users"
<netarchivesuite-users-bounces at ml.sbforge.org>
Dear all,
I have a question about traps. We have identified thousands of 404 codes
in our crawls and want to add them as traps in the harvest. However, there
are over 23,000 of them, and when I try to save them, I get a 502 error.
Is there any way to add all these traps?
Thank you very much in advance for your help.
Best regards,
Miguel.
Este mensaje y cualquier fichero adjunto están dirigidos únicamente a sus
destinatarios y contiene información confidencial. Si usted ha recibido
este correo electrónico por error, le informamos que no puede realizar
ninguna revisión, alteración, impresión, copia, transmisión, difusión ni
utilización alguna de este mensaje ni de cualquier fichero adjunto que
pudiese contener. La realización de cualquiera de los actos indicados está
expresamente prohibida por las Normas que regulan estas materias. Por todo
ello se solicita que, en caso de existir error en la recepción de este
mensaje, se lo notifique al remitente respondiendo a este e-mail y elimine
el mensaje y su contenido inmediatamente. La Biblioteca Nacional de España
se reserva las acciones legales que le correspondan en el caso de que se
infrinja lo indicado anteriormente. The information in this e-mail and any
attachments is confidential and it is intended for the addressee only. If
you have received this e-mail in error, you are notified that any
revision, amendment, print, copy, disclosure, distribution or use of the
contents is unauthorized. Carrying out any of the above actions, is
expressly banned by rules governing this matter. Hence we request that if
you are not the intended recipient, please notify the sender answering
this e-mail, and delete the message and any attachments. The National
Library of Spain reserves itself the right to take the appropriate legal
actions in the event of the above mentioned matter is being infringed.
[pièce jointe "attrxbop.txt" supprimée par Bert WENDLAND/ETS/BnF]
Venez découvrir le le musée de la BnF à Richelieu . Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20250812/89ffe631/attachment.html>
More information about the NetarchiveSuite-users
mailing list