[Netarchivesuite-devel] ExtractorJS vs IcelandicExtractorJS

Clara Wiatrowski clara.wiatrowski at gmail.com
Thu Jul 9 17:19:03 CEST 2020


Hi Andreas,

At BnF, we use the new extractorJS without properties
<bean id="extractorJs" class="org.archive.modules.extractor.ExtractorJS">
</bean>

Instead of the icelandic one:
<bean id="icelandicExtractorJs"
class="dk.netarkivet.harvester.harvesting.extractor.IcelandicExtractorJS">
         Possible to define this value in NetarchiveSuite GUI
        <property name="enabled" value="true" />
        <property name="rejectRelativeMatchingRegexList">
            <list>
                <value>^text/javascript$</value>
                <value>^text/css$</value>
                <value>^a\.[^/]+$</value>
                <value>^div\.[^/]+$</value>
                E.g. 3.5.0. Very common in some JS libraries for strings of
this nature but very unlikely to be a relative URL
                <value>^[0-9]\.([0-9]\.)[0-9]$</value>
                <value>^Microsoft\.XMLHTTP$</value>
            </list>
        </property>
    </bean>

I don't know if the extractorJs has rejectRelativeMatchingRegexList
property.

Best,
Clara

Le mer. 8 juil. 2020 à 10:04, <aponb at gmx.at> a écrit :

> Getting back to yesterdays call. The suggestion with NAS 6.0 is now going
> back to the default ExtractorJS. So using
>
> <bean id="extractorJs" class="org.archive.modules.extractor.ExtractorJS">
>  </bean>
>
>
> instead of
>
> <bean id="icelandicExtractorJs" class="dk.netarkivet.harvester.harvesting.extractor.IcelandicExtractorJS">
> </bean>
>
> Does the default Extractor have the same Properties as the
> IcelandicExtractor like the rejectRelativeMatchingRegexList?
>
> This would be a perfect sample for a NAS-Knowledge-DB!
>
> Regards
>
> a.
>
> _______________________________________________
> Netarchivesuite-devel mailing list
> Netarchivesuite-devel at ml.sbforge.org
> https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20200709/28deeb06/attachment.html>


More information about the Netarchivesuite-devel mailing list