[Netarchivesuite-users] ExtractorJS and double slashes

Søren Vejrup Carlsen svc at kb.dk
Mon Jul 22 13:52:22 CEST 2013


Hi Meelis.
I did find a version in distribution called 1.15.5, and have embedded the ExtractorJs found there in the coming NetarchiveSuite 4.2 release:

https://sbforge.org/fisheye/browse/NetarchiveSuite/trunk/src/dk/netarkivet/harvester/harvesting/extractor/ExtractorJS.java?r=2723

Best Regards
Søren

-----Oprindelig meddelelse-----
Fra: netarchivesuite-users-bounces at ml.sbforge.org [mailto:netarchivesuite-users-bounces at ml.sbforge.org] På vegne af Meelis Mihhailov
Sendt: 13. juli 2013 00:22
Til: netarchivesuite-users at ml.sbforge.org
Emne: Re: [Netarchivesuite-users] ExtractorJS and double slashes

Hi Søren and thanks for the reply

I looked for the source but it seems that it has been taken down. I remember that I did not actually look for the source and found the instructions by accident. Due to some cleanup in the server I have removed the copy that I somehow downloaded from svn repository (do not know how to write applications in java) ... and now I cant find the webpage that included the needed instructions.

After doing some searching I managed to find this link : 
https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/branches/
but not sure if it includes the 1.14.5 version. :(

Meelis Mihhailov
National Library Of Estonia
meelis at nlib.ee





On 11.07.2013 16:17, Søren Vejrup Carlsen wrote:
> Hi Meelis.
> No, I don't think we have experienced the same problem here at Netarkivet.
> But I have thinking about that it might be quite easy to embed the extractorJS module from 1.14.5 in the NetarchiveSuite code base. Could you send me a link to the 1.14.5 codebase so I could see if that is possible.
>
> If it is, we could make it available in the next upcoming release of NetarchiveSuite 4.2 in August.
>
> Best Regards
>
> Søren Vejrup Carlsen, Developer of NetarchiveSuite
>
> -----Oprindelig meddelelse-----
> Fra:netarchivesuite-users-bounces at ml.sbforge.org  
> [mailto:netarchivesuite-users-bounces at ml.sbforge.org] På vegne af 
> Meelis Mihhailov
> Sendt: 17. juni 2013 16:57
> Til:netarchivesuite-users at ml.sbforge.org
> Emne: [Netarchivesuite-users] ExtractorJS and double slashes
>
> Hi everyone
>
> I have a problem with websites that generate links to resources with javascript. The result:
>
> http://www.aki.ut.ee/et/instituut/koostoo/eesti-akadeemiline-ajakirjan
> duse-selts/sites//all//modules//contrib//jquery_update//replace//jquer
> y//jquery.min.js
>
> (Just an example of the link ... )
>
> Problem is caused by this source:
>
> jQuery.extend(Drupal.settings,{"basePath":"\/","pathPrefix":"et\/","ajaxPageState":{"theme":"ut_sh","theme_token":"juS8I_9E4cvz29b2VXNpqasMg1rcCylj6mhtnjl1UO4","js":{"sites\/all\/modules\/contrib\/jquery_update\/replace\/jquery\/jquery.min.js":1,"misc\/jquery.once.js":1,"misc\/drupal.js":1,"sites\/all\/modules\/contrib\/jquery_update\/replace\/ui\/external\/jquery.cookie.js":1,"misc\/form.js":1,"sites\/all\/modules\/contrib\/fb\/fb.js":1,"sites\/all\/modules\/contrib\/administrative\/admin_menu\/admin_devel\/admin_devel.js"
> etc.
>
> As far as my research took me I found out that the problem resides in 
> extractorJS module and fix has been added to the 1.14.5 code. However 
> the .5 has not been released and will never be :(
>
> At the moment I found this solution (I did some changes in name and class as the original example does not have them):
>
>           <newObject name="RemoveSlashes"
> class="org.archive.crawler.extractor.ExtractorImpliedURI">
>
>               <boolean name="enabled">true</boolean>
>
>               <string 
> name="trigger-regexp">(^http.*://.*)//(.*$)</string>
>
>               <string name="build-pattern">$1/$2</string>
>
>               <boolean name="remove-trigger-uris">false</boolean>
>
>           </newObject>
>
> but it does not work and by the regex I can see it only works with one 
> '//' occurance. (We get random number of '//')
>
> Has anyone experienced the same problem and can provide me with a working solution?
> Using NetarchiveSuite Version: 3.21.0
>
> Thanks
>
> Meelis Mihhailov
> National Library Of Estonia
> meelis at nlib.ee
>
>
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at ml.sbforge.org
> http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
>
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at ml.sbforge.org
> http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users



More information about the NetarchiveSuite-users mailing list