[Netarchivesuite-users] ExtractorJS and double slashes

Meelis Mihhailov meelis at nlib.ee
Mon Jul 22 15:26:46 CEST 2013


Hi Søren and thank you! Looking forward for the 4.2 release :)

Meelis

On 22.07.2013 14:52, Søren Vejrup Carlsen wrote:
> Hi Meelis.
> I did find a version in distribution called 1.15.5, and have embedded the ExtractorJs found there in the coming NetarchiveSuite 4.2 release:
>
> https://sbforge.org/fisheye/browse/NetarchiveSuite/trunk/src/dk/netarkivet/harvester/harvesting/extractor/ExtractorJS.java?r=2723
>
> Best Regards
> Søren
>
> -----Oprindelig meddelelse-----
> Fra: netarchivesuite-users-bounces at ml.sbforge.org [mailto:netarchivesuite-users-bounces at ml.sbforge.org] På vegne af Meelis Mihhailov
> Sendt: 13. juli 2013 00:22
> Til: netarchivesuite-users at ml.sbforge.org
> Emne: Re: [Netarchivesuite-users] ExtractorJS and double slashes
>
> Hi Søren and thanks for the reply
>
> I looked for the source but it seems that it has been taken down. I remember that I did not actually look for the source and found the instructions by accident. Due to some cleanup in the server I have removed the copy that I somehow downloaded from svn repository (do not know how to write applications in java) ... and now I cant find the webpage that included the needed instructions.
>
> After doing some searching I managed to find this link :
> https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/branches/
> but not sure if it includes the 1.14.5 version. :(
>
> Meelis Mihhailov
> National Library Of Estonia
> meelis at nlib.ee
>
>
>
>
>
> On 11.07.2013 16:17, Søren Vejrup Carlsen wrote:
>> Hi Meelis.
>> No, I don't think we have experienced the same problem here at Netarkivet.
>> But I have thinking about that it might be quite easy to embed the extractorJS module from 1.14.5 in the NetarchiveSuite code base. Could you send me a link to the 1.14.5 codebase so I could see if that is possible.
>>
>> If it is, we could make it available in the next upcoming release of NetarchiveSuite 4.2 in August.
>>
>> Best Regards
>>
>> Søren Vejrup Carlsen, Developer of NetarchiveSuite
>>
>> -----Oprindelig meddelelse-----
>> Fra:netarchivesuite-users-bounces at ml.sbforge.org
>> [mailto:netarchivesuite-users-bounces at ml.sbforge.org] På vegne af
>> Meelis Mihhailov
>> Sendt: 17. juni 2013 16:57
>> Til:netarchivesuite-users at ml.sbforge.org
>> Emne: [Netarchivesuite-users] ExtractorJS and double slashes
>>
>> Hi everyone
>>
>> I have a problem with websites that generate links to resources with javascript. The result:
>>
>> http://www.aki.ut.ee/et/instituut/koostoo/eesti-akadeemiline-ajakirjan
>> duse-selts/sites//all//modules//contrib//jquery_update//replace//jquer
>> y//jquery.min.js
>>
>> (Just an example of the link ... )
>>
>> Problem is caused by this source:
>>
>> jQuery.extend(Drupal.settings,{"basePath":"\/","pathPrefix":"et\/","ajaxPageState":{"theme":"ut_sh","theme_token":"juS8I_9E4cvz29b2VXNpqasMg1rcCylj6mhtnjl1UO4","js":{"sites\/all\/modules\/contrib\/jquery_update\/replace\/jquery\/jquery.min.js":1,"misc\/jquery.once.js":1,"misc\/drupal.js":1,"sites\/all\/modules\/contrib\/jquery_update\/replace\/ui\/external\/jquery.cookie.js":1,"misc\/form.js":1,"sites\/all\/modules\/contrib\/fb\/fb.js":1,"sites\/all\/modules\/contrib\/administrative\/admin_menu\/admin_devel\/admin_devel.js"
>> etc.
>>
>> As far as my research took me I found out that the problem resides in
>> extractorJS module and fix has been added to the 1.14.5 code. However
>> the .5 has not been released and will never be :(
>>
>> At the moment I found this solution (I did some changes in name and class as the original example does not have them):
>>
>>            <newObject name="RemoveSlashes"
>> class="org.archive.crawler.extractor.ExtractorImpliedURI">
>>
>>                <boolean name="enabled">true</boolean>
>>
>>                <string
>> name="trigger-regexp">(^http.*://.*)//(.*$)</string>
>>
>>                <string name="build-pattern">$1/$2</string>
>>
>>                <boolean name="remove-trigger-uris">false</boolean>
>>
>>            </newObject>
>>
>> but it does not work and by the regex I can see it only works with one
>> '//' occurance. (We get random number of '//')
>>
>> Has anyone experienced the same problem and can provide me with a working solution?
>> Using NetarchiveSuite Version: 3.21.0
>>
>> Thanks
>>
>> Meelis Mihhailov
>> National Library Of Estonia
>> meelis at nlib.ee
>>
>>
>> _______________________________________________
>> NetarchiveSuite-users mailing list
>> NetarchiveSuite-users at ml.sbforge.org
>> http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
>>
>> _______________________________________________
>> NetarchiveSuite-users mailing list
>> NetarchiveSuite-users at ml.sbforge.org
>> http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at ml.sbforge.org
> http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
>
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at ml.sbforge.org
> http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users



More information about the NetarchiveSuite-users mailing list