nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadjiat Souad (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2074) Javascript link not parsed by JSParseFilter
Date Mon, 03 Aug 2015 10:23:04 GMT
Hadjiat Souad created NUTCH-2074:
------------------------------------

             Summary: Javascript link not parsed by JSParseFilter
                 Key: NUTCH-2074
                 URL: https://issues.apache.org/jira/browse/NUTCH-2074
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.10
            Reporter: Hadjiat Souad
            Priority: Minor


JSParseFilter can't extract properly this link :

javascript:tb_show('','http://dummy.url/3S/FRA/contenus/ext/endeca/html/dummy-page.html?TB_iframe=true&height=310&width=600','');

I have run a junit test in debug mode and it seems that the regular expression JSParseFilter.STRING_PATTERN
matches ',' only, and doesn't extract the url.
As I'm not the best in regular expressions, I can't propose a patch..

The complete html element is : 
<a class="last" href="javascript:tb_show('','http://dummy.url/3S/FRA/contenus/ext/endeca/html/dummy-page.html?TB_iframe=true&height=310&width=600','');">Dummy
url</a>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message