[Netarchivesuite-users] ExtractorJS and double slashes
Meelis Mihhailov
meelis at nlib.ee
Mon Jun 17 16:56:57 CEST 2013
Hi everyone
I have a problem with websites that generate links to resources with
javascript. The result:
http://www.aki.ut.ee/et/instituut/koostoo/eesti-akadeemiline-ajakirjanduse-selts/sites//all//modules//contrib//jquery_update//replace//jquery//jquery.min.js
(Just an example of the link ... )
Problem is caused by this source:
jQuery.extend(Drupal.settings,{"basePath":"\/","pathPrefix":"et\/","ajaxPageState":{"theme":"ut_sh","theme_token":"juS8I_9E4cvz29b2VXNpqasMg1rcCylj6mhtnjl1UO4","js":{"sites\/all\/modules\/contrib\/jquery_update\/replace\/jquery\/jquery.min.js":1,"misc\/jquery.once.js":1,"misc\/drupal.js":1,"sites\/all\/modules\/contrib\/jquery_update\/replace\/ui\/external\/jquery.cookie.js":1,"misc\/form.js":1,"sites\/all\/modules\/contrib\/fb\/fb.js":1,"sites\/all\/modules\/contrib\/administrative\/admin_menu\/admin_devel\/admin_devel.js"
etc.
As far as my research took me I found out that the problem resides in
extractorJS module and fix has been added to the 1.14.5 code. However
the .5 has not been released and will never be :(
At the moment I found this solution (I did some changes in name and
class as the original example does not have them):
<newObject name="RemoveSlashes"
class="org.archive.crawler.extractor.ExtractorImpliedURI">
<boolean name="enabled">true</boolean>
<string name="trigger-regexp">(^http.*://.*)//(.*$)</string>
<string name="build-pattern">$1/$2</string>
<boolean name="remove-trigger-uris">false</boolean>
</newObject>
but it does not work and by the regex I can see it only works with one
'//' occurance. (We get random number of '//')
Has anyone experienced the same problem and can provide me with a
working solution?
Using NetarchiveSuite Version: 3.21.0
Thanks
Meelis Mihhailov
National Library Of Estonia
meelis at nlib.ee
More information about the NetarchiveSuite-users
mailing list