[Netarchivesuite-users] ExtractorJS and double slashes

Søren Vejrup Carlsen svc at kb.dk
Thu Jul 11 15:17:56 CEST 2013


Hi Meelis.
No, I don't think we have experienced the same problem here at Netarkivet.
But I have thinking about that it might be quite easy to embed the extractorJS module from 1.14.5 in the NetarchiveSuite code base. Could you send me a link to the 1.14.5 codebase so I could see if that is possible.

If it is, we could make it available in the next upcoming release of NetarchiveSuite 4.2 in August.

Best Regards

Søren Vejrup Carlsen, Developer of NetarchiveSuite

-----Oprindelig meddelelse-----
Fra: netarchivesuite-users-bounces at ml.sbforge.org [mailto:netarchivesuite-users-bounces at ml.sbforge.org] På vegne af Meelis Mihhailov
Sendt: 17. juni 2013 16:57
Til: netarchivesuite-users at ml.sbforge.org
Emne: [Netarchivesuite-users] ExtractorJS and double slashes

Hi everyone

I have a problem with websites that generate links to resources with javascript. The result:

http://www.aki.ut.ee/et/instituut/koostoo/eesti-akadeemiline-ajakirjanduse-selts/sites//all//modules//contrib//jquery_update//replace//jquery//jquery.min.js

(Just an example of the link ... )

Problem is caused by this source:

jQuery.extend(Drupal.settings,{"basePath":"\/","pathPrefix":"et\/","ajaxPageState":{"theme":"ut_sh","theme_token":"juS8I_9E4cvz29b2VXNpqasMg1rcCylj6mhtnjl1UO4","js":{"sites\/all\/modules\/contrib\/jquery_update\/replace\/jquery\/jquery.min.js":1,"misc\/jquery.once.js":1,"misc\/drupal.js":1,"sites\/all\/modules\/contrib\/jquery_update\/replace\/ui\/external\/jquery.cookie.js":1,"misc\/form.js":1,"sites\/all\/modules\/contrib\/fb\/fb.js":1,"sites\/all\/modules\/contrib\/administrative\/admin_menu\/admin_devel\/admin_devel.js" 
etc.

As far as my research took me I found out that the problem resides in extractorJS module and fix has been added to the 1.14.5 code. However the .5 has not been released and will never be :(

At the moment I found this solution (I did some changes in name and class as the original example does not have them):

         <newObject name="RemoveSlashes" 
class="org.archive.crawler.extractor.ExtractorImpliedURI">

             <boolean name="enabled">true</boolean>

             <string name="trigger-regexp">(^http.*://.*)//(.*$)</string>

             <string name="build-pattern">$1/$2</string>

             <boolean name="remove-trigger-uris">false</boolean>

         </newObject>

but it does not work and by the regex I can see it only works with one '//' occurance. (We get random number of '//')

Has anyone experienced the same problem and can provide me with a working solution?
Using NetarchiveSuite Version: 3.21.0

Thanks

Meelis Mihhailov
National Library Of Estonia
meelis at nlib.ee


_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users



More information about the NetarchiveSuite-users mailing list