[Netarchivesuite-devel] ExtractorJS vs IcelandicExtractorJS

aponb at gmx.at aponb at gmx.at
Sun Jul 12 13:20:00 CEST 2020


Hi Clara,

Thanks for your answer.
I am going to start our domaincrawl with this setting.

Best regards
a.

> Hi Andreas,
>
> At BnF, we use the new extractorJS without properties
> <bean id="extractorJs" class="org.archive.modules.extractor.ExtractorJS">
> </bean>
>
> Instead of the icelandic one:
> <bean id="icelandicExtractorJs"
> class="dk.netarkivet.harvester.harvesting.extractor.IcelandicExtractorJS">
>           Possible to define this value in NetarchiveSuite GUI
>          <property name="enabled" value="true" />
>          <property name="rejectRelativeMatchingRegexList">
>              <list>
>                  <value>^text/javascript$</value>
>                  <value>^text/css$</value>
>                  <value>^a\.[^/]+$</value>
>                  <value>^div\.[^/]+$</value>
>                  E.g. 3.5.0. Very common in some JS libraries for strings of
> this nature but very unlikely to be a relative URL
>                  <value>^[0-9]\.([0-9]\.)[0-9]$</value>
>                  <value>^Microsoft\.XMLHTTP$</value>
>              </list>
>          </property>
>      </bean>
>
> I don't know if the extractorJs has rejectRelativeMatchingRegexList
> property.
>
> Best,
> Clara
>
> Le mer. 8 juil. 2020 à 10:04, <aponb at gmx.at> a écrit :
>
>> Getting back to yesterdays call. The suggestion with NAS 6.0 is now going
>> back to the default ExtractorJS. So using
>>
>> <bean id="extractorJs" class="org.archive.modules.extractor.ExtractorJS">
>>   </bean>
>>
>>
>> instead of
>>
>> <bean id="icelandicExtractorJs" class="dk.netarkivet.harvester.harvesting.extractor.IcelandicExtractorJS">
>> </bean>
>>
>> Does the default Extractor have the same Properties as the
>> IcelandicExtractor like the rejectRelativeMatchingRegexList?
>>
>> This would be a perfect sample for a NAS-Knowledge-DB!
>>
>> Regards
>>
>> a.
>>
>> _______________________________________________
>> Netarchivesuite-devel mailing list
>> Netarchivesuite-devel at ml.sbforge.org
>> https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel
>>



More information about the Netarchivesuite-devel mailing list