[Netarchivesuite-curator] Netarchive NAS update for August - October
Pérez Morillo, Mar
mar.perez at bne.es
Tue Oct 3 12:57:04 CEST 2017
Dear all,
Here is the update from the BNE:
Our IT team has been working on the implementation of NAS 5 and they installed a complete preproduction environment (connected to CWeb) of NAS 5.3. They run several tests and checked that some problems we had been experienced with NAS 4 (especially related to security certificates) have been solved with NAS 5.
We expect to have a complete production installation of NAS 5.3 by the week of October 23rd. Once this version is installed in a production environment, our first task will be to run a domain crawl of .gal domain (the domain attached to Galicia). We expect to have it finished in about 3 days.
We’ve been also concentrated in curating our Catalan Politics collection, which was a thematic collection, but it’s indeed a mixture of thematic and event collection. We decided to keep it as it previously was (a thematic collection), but adding new seeds, launching it more frequently and tuning some configurations.
We finally made access available to our web archive at the beginning of July. The online access only allows seeing what captures we have from every site, but the archived content itself is only accessible in our premises and the ones at the regional libraries with legal deposit competencies. Some of them have opened also this access to their users. It is not allowed to download or copy any part of the web archive, due to our copyright law limitations.
We are also preparing our annual workshop with regional web curators, scheduled for November 20th, to review the state of the art of our collaborative project of web archiving and non-print legal deposit.
Talk to you in a couple of minutes.
Best,
Mar Pérez Morillo
Jefe del Área de Gestión del Depósito de las Publicaciones en Línea
División de Procesos y Servicios Digitales
Tfno.: 91 516 89 92
Biblioteca Nacional de España
De: Netarchivesuite-curator [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] En nombre de Sabine Schostag
Enviado el: martes, 03 de octubre de 2017 10:06
Para: netarchivesuite-curator at ml.sbforge.org
Asunto: Re: [Netarchivesuite-curator] Netarchive NAS update for August - October
Dear all.
Hereby a brief update from KB DK:
We are preparing a 2-days workshop for Netarchive curators on harvesting social media. Hopefully the outcome will be usefull for our coming event harvest on local and regional elections on 21 November. We also aim to use BCWeb with external partners on the election event harvest.
The developers are going to have a workshop in the middle of October. The curator wishes are as follows (in order of priority):
· Replay of https-pages in Wayback
· Improvement of Heritrix and integration of supplementary collection tools (e.g. brozzler)
· Introduction of a (technical) collection concept. This will give us the ability to integrate data collected before and without NAS.
· Improvement of Access
· More automated QA
Most likely we wil not be able to perform a full broad crawl with 2 steps this year (our last full broad crawl is from the beginning of 2016), because of our problems with Heritrix 3 Remote Access. We expect to be able to solve this problem with NAS 5.4, which will be implemented after having finished the compression of the archive in the beginning of 2018.
Since January 2017 we only harvested about 25 TB
In the beginning of September 2017 Netarchive was blocked by about 54.000 domains (out of 1.32 Mill. Domains)
The implementation of “Web Danica” (automated identification of Danish web content outside .dk) is ongoing.
The migration of documentation from the old “MediaWiki” to Jira is finished.
Talk to you later today [😊]
Best, Sabine
________________________________
Fra: Netarchivesuite-curator <netarchivesuite-curator-bounces at ml.sbforge.org<mailto:netarchivesuite-curator-bounces at ml.sbforge.org>> på vegne af peter.stirling at bnf.fr<mailto:peter.stirling at bnf.fr> <peter.stirling at bnf.fr<mailto:peter.stirling at bnf.fr>>
Sendt: 2. oktober 2017 13:55
Til: netarchivesuite-curator at ml.sbforge.org<mailto:netarchivesuite-curator at ml.sbforge.org>
Emne: [Netarchivesuite-curator] BnF NAS update for October
Hello all,
There have been several changes in the team over the summer. Pascal Tanésie has arrived as assistant head of the digital legal deposit team, and Vladimir Tybin has joined the team as digital curator. Sophie Derrot has left the BnF to take up a post at the Institut national d'histoire de l'art.
Our second test broad crawl, with the complete seed list, is nearly finished. The amount of data crawled in this test has proved to be higher than our budget estimates, mainly because there is no deduplication for this first broad crawl with H3. We will analyze the figures in detail and adapt the budget accordingly.
We are also using our new infrastructure for the tests: the crawlers are more powerful and faster but they use more bandwith. We will therefore need to reduce the number of crawlers from 40 to 35. We had set the duration of each job to 3 days but this has proved to be too much, for the real crawl it will be betwen 2 and 2.5 days.
This week we aim to transfer all our crawls onto the new infrastructure and the next week the real broad crawl will start.
Best regards,
The BnF digital legal deposit team
________________________________
Nouveau :
Ouverture du site Bibliothèques d’Orient<http://heritage.bnf.fr/bibliothequesorient/fr> - 7000 documents de 9 collections dans un site trilingue
Avant d'imprimer, pensez à l'environnement.
________________________________
Este mensaje y cualquier fichero adjunto están dirigidos únicamente a sus destinatarios y contiene información confidencial. Si usted ha recibido este correo electrónico por error, le informamos que no puede realizar ninguna revisión, alteración, impresión, copia, transmisión, difusión ni utilización alguna de este mensaje ni de cualquier fichero adjunto que pudiese contener. La realización de cualquiera de los actos indicados está expresamente prohibida por las Normas que regulan estas materias. Por todo ello se solicita que, en caso de existir error en la recepción de este mensaje, se lo notifique al remitente respondiendo a este e-mail y elimine el mensaje y su contenido inmediatamente. La Biblioteca Nacional de España se reserva las acciones legales que le correspondan en el caso de que se infrinja lo indicado anteriormente.
________________________________
The information in this e-mail and any attachments is confidential and it is intended for the addressee only. If you have received this e-mail in error, you are notified that any revision, amendment, print, copy, disclosure, distribution or use of the contents is unauthorized. Carrying out any of the above actions, is expressly banned by rules governing this matter. Hence we request that if you are not the intended recipient, please notify the sender answering this e-mail, and delete the message and any attachments. The National Library of Spain reserves itself the right to take the appropriate legal actions in the event of the above mentioned matter is being infringed.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20171003/faa38d76/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 488 bytes
Desc: image001.png
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20171003/faa38d76/attachment-0001.png>
More information about the Netarchivesuite-curator
mailing list