[Netarchivesuite-curator] BnF NAS update for February
geraldine.camile at bnf.fr
geraldine.camile at bnf.fr
Mon Feb 4 16:26:12 CET 2019
Hello,
Our broad crawl finished on December the 23rd. It represents 2.1 billion
URLs and 106.46 TB. Due to technical difficulties it took a long time: 11
weeks (compared to 6 weeks in 2017). The technical difficulties came from
the new computer architecture and the hardware, the broker and the version
of NAS, resulting in multiples jobs being created that failed and thus an
overall slowdown of the crawl. We will discuss this subject during the NAS
workshop.The percentage of domains that are fully crawled has also
decreased. We haven't finished anlaysing this collection but we've chosen
to focus on the websites published for the young.
We have finished analysing all the 2018 crawl reports. Over the year we
crawled 2.6 billion URLs and 136.15 TB. This is 9 TB less than 2018 due to
deduplication: we've crawled more in 2019 but with deduplication,
especially for the broad crawl. The proportion of the broad crawl
compared to the selective crawls is still growing: the broad crawl
represents 78% of the 2018 collections and 70% in 2017. Our collections
now represent more than 1 Petabyte (1 074.73 TB).
From mid-December to mid-January, we organised an internal workshop to
improve the harvesting of social media (Facebook, Instagram, Twitter). We
are able to crawl Facebook with the same Heritrix template we used for
Twitter. But the quality of the crawl isn't guaranteed: the quality is
significantly downgraded when there are more than 500 accounts in the job,
and from one crawl to another the quality is very variable (sometimes we
crawl nothing). We crawl basically the homepage, the posts and a lot of
images: it's difficult to know exactly which images we crawl because a lot
of them are not visible in the Wayback. During the workshop, we tried to
crawl social media with Umbra. Umbra is very complex to install and
there's no information exchange between Umbra and NAS: sometimes Umbra
failed and Heritrix continued to collect. However Umbra allow us to crawl
the images on Instagram that we couldn't crawl with Heritrix. We compared
also the restitution of the web archives in Python Wayback with
OpenWayback. The restitution is better with Python especially for
Instagram: the images are displayed while in the OpenWayback we have just
a white page. For Twitter, the scroll down seems to work in the access
tool (but we must do more tests). But for Facebook, we hardly noticed any
change.
Best regards,
The BnF digital legal deposit team
Exposition Les Nadar, une légende photographique – jusqu'au 3 février 2019 | François-Mitterrand Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20190204/a9c46453/attachment.html>
More information about the Netarchivesuite-curator
mailing list