[Netarchivesuite-users] ExtractorJS and double slashes

Meelis Mihhailov meelis at nlib.ee
Mon Jun 17 16:56:57 CEST 2013


Hi everyone

I have a problem with websites that generate links to resources with 
javascript. The result:

http://www.aki.ut.ee/et/instituut/koostoo/eesti-akadeemiline-ajakirjanduse-selts/sites//all//modules//contrib//jquery_update//replace//jquery//jquery.min.js

(Just an example of the link ... )

Problem is caused by this source:

jQuery.extend(Drupal.settings,{"basePath":"\/","pathPrefix":"et\/","ajaxPageState":{"theme":"ut_sh","theme_token":"juS8I_9E4cvz29b2VXNpqasMg1rcCylj6mhtnjl1UO4","js":{"sites\/all\/modules\/contrib\/jquery_update\/replace\/jquery\/jquery.min.js":1,"misc\/jquery.once.js":1,"misc\/drupal.js":1,"sites\/all\/modules\/contrib\/jquery_update\/replace\/ui\/external\/jquery.cookie.js":1,"misc\/form.js":1,"sites\/all\/modules\/contrib\/fb\/fb.js":1,"sites\/all\/modules\/contrib\/administrative\/admin_menu\/admin_devel\/admin_devel.js" 
etc.

As far as my research took me I found out that the problem resides in 
extractorJS module and fix has been added to the 1.14.5 code. However 
the .5 has not been released and will never be :(

At the moment I found this solution (I did some changes in name and 
class as the original example does not have them):

         <newObject name="RemoveSlashes" 
class="org.archive.crawler.extractor.ExtractorImpliedURI">

             <boolean name="enabled">true</boolean>

             <string name="trigger-regexp">(^http.*://.*)//(.*$)</string>

             <string name="build-pattern">$1/$2</string>

             <boolean name="remove-trigger-uris">false</boolean>

         </newObject>

but it does not work and by the regex I can see it only works with one 
'//' occurance. (We get random number of '//')

Has anyone experienced the same problem and can provide me with a 
working solution?
Using NetarchiveSuite Version: 3.21.0

Thanks

Meelis Mihhailov
National Library Of Estonia
meelis at nlib.ee




More information about the NetarchiveSuite-users mailing list