[Netarchivesuite-curator] BnF NAS update for July

alexandre.chautemps at bnf.fr alexandre.chautemps at bnf.fr
Mon Jul 6 15:04:28 CEST 2020


Dear all, 

At the end of June, we have put in production the new version of NAS 
(6.0.0)  with the official IIPC version of Heritrix (3.4.0-20200518). By 
this upgrade, we intend to improve the quality and the completeness of our 
crawlings. The new version of Heritrix includes contributions done by 
BnF's IT team's developers : treatment of the "data" attribute in the 
pictures tags, and harvesting of the files hosted on servers secured by 
SFTP, and not only on FTP servers.  With the new Javascript extractor and 
the inclusion of "data" attrributes, we expect a significative 
amelioration in the harvesting of pictures, especially for the responsive 
websites. In addition, the new version of Heritrix allows parallelization 
of queues, and we expect more rapidity and completeness in the social 
networks accounts harvesting, singularly Twitter. In the next weeks, we 
plan to compare jobs done by the previous and the new version of Heritrix, 
to assess if these improvements become a reality.

The second round of the local elections was held on 28th of June.  Since 
the beginning of June, our elections crawl continues with the initial 
schedule again : social networks crawled twice a day and other websites 
crawled twice a month. The crawling will go on until mid-July to cover the 
setup of the new city councils and the investiture of the mayors.

Best regards,

The BnF digital legal deposit team

Réouverture progressive de la BnF à partir du 6 juillet,  retrouvez les modalités ici Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20200706/a203723e/attachment.html>


More information about the Netarchivesuite-curator mailing list