manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: web connector : links extraction issues
Date Thu, 15 Nov 2018 11:57:09 GMT
Hi Olivier,

You can create a ticket but I don't have a good solution for you in any
case.

Karl


On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard <
olivier.tavard@francelabs.com> wrote:

> Hi Karl,
>
> Do you think that I need to create a Jira issue relative to this bug ie
> that the links extraction does not work if inside Javascript tags some code
> contain special characters like '>', '< '?
>
> Thanks,
> Best regards,
>
> Olivier
>
>
>
> Le 30 oct. 2018 à 12:05, Olivier Tavard <olivier.tavard@francelabs.com> a
> écrit :
>
> Hi Karl,
>
> Thanks for your answer.
> I kept looking into this and I found what was the problem. The Javascript
> code into the tags <script></scripts>  contained the character '<'. If
so
> the links extraction does not work with the web connector.
>
> To reproduce it, I created this page hosted in local Apache then I indexed
> it with MCF 2.11 out of the box.
>
> in the first example the page was :
> <!DOCTYPE html>
>
> <head>
> <title>test</title>
> <meta charset="utf-8" />
> *<script type="text/javascript"></script>*
>
> </head>
> <body>
>
> <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a>
> </body>
>
> The links extraction was correct, in the debug log :
> DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an
> HttpClient object
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For
> http://localhost:8888/testjs/test.html, setting virtual host to localhost
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an
> HttpClient object after 1 ms.
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for
> '/testjs/test.html'
>  INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL|
> http://localhost:8888/testjs/test.html|1540896372585+75|200|223|
> <http://localhost:8888/testjs/test.html%7C1540896372585+75%7C200%7C223%7C>
> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 'http://localhost:8888/testjs/test.html'
> is text, with encoding 'UTF-8'; link extraction starting
> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document
> 'http://localhost:8888/testjs/test.html', found link to
> 'https://manifoldcf.apache.org/en_US/index.html'
> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content
> exclusion rule supplied... returning
> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to
> ingest 'http://localhost:8888/testjs/test.html'
> —
> In the second example, the code was pretty quite the same except that I
> included the character '<' in the content of the script tags :
> <!DOCTYPE html>
>
> <head>
> <title>test</title>
> <meta charset="utf-8" />
> *<script type="text/javascript">a<b</script>*
>
> </head>
> <body>
>
>     <a href="https://manifoldcf.apache.org/en_US/index.html
> ">manifoldcf</a>
>
> </body>
>
> The links extraction was not successful, the debug log indicates :
> DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an
> HttpClient object
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For
> http://localhost:8888/testjs/test.html, setting virtual host to localhost
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an
> HttpClient object after 1 ms.
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for
> '/testjs/test.html'
>  INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL|
> http://localhost:8888/testjs/test.html|1540896493475+76|200|226|
> <http://localhost:8888/testjs/test.html%7C1540896493475+76%7C200%7C226%7C>
> DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 'http://localhost:8888/testjs/test.html'
> is text, with encoding 'UTF-8'; link extraction starting
> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content
> exclusion rule supplied... returning
> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to
> ingest 'http://localhost:8888/testjs/test.html'
> —
> So special characters like the less than sign should be escaped in the
> code of the web connector to preserve the links extraction.
>
> Thanks,
> Best regards,
>
>
> Olivier
>
> Le 29 oct. 2018 à 19:39, Karl Wright <daddywri@gmail.com> a écrit :
>
> Hi Olivier,
>
> Javascript inclusion in the Web Connector is not evaluated.  In fact, no
> Javascript is executed at all.  Therefore it should not matter what is
> included via javascript.
>
> Thanks,
> Karl
>
>
> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <
> olivier.tavard@francelabs.com> wrote:
>
>> Hi,
>>
>> Regarding the web connector, I noticed that for specific websites, some
>> Javascript code can prevent the web connector to fetch correctly all the
>> links present on the page. Specifically, for websites that contain a
>> deprecated version of New relic web agent as
>> js-agent.newrelic.com/nr-1071.min.js.
>> After downloading the page locally and removing the reference to the new
>> relic agent browser, the links were correctly fetched in the page by the
>> web connector. So it seems that the Javascript injection here caused by
>> the new relic agent was the cause of the links not fetched in the page.
>> This case is rare and concerns only old versions of New Relic agent. But
>> in a more generic way, would it be possible to block the javascript
>> injection at the connector level during the indexation ?
>>
>> Thanks,
>> Best regards,
>> Olivier
>>
>>
>>
>
>

Mime
View raw message