manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Tavard <olivier.tav...@francelabs.com>
Subject Re: web connector : links extraction issues
Date Thu, 15 Nov 2018 12:21:53 GMT
Hi Karl,

Thanks for your answer. 
Could you detail your answer please ? Just to better understand : you mean that there is no
chance that special characters could be escaped in the MCF code in this case ie the website
needs to escape itself the special characters otherwise the extraction will not work in MCF,
am I right ?

Best regards,

Olivier



> Le 15 nov. 2018 à 12:57, Karl Wright <daddywri@gmail.com> a écrit :
> 
> Hi Olivier,
> 
> You can create a ticket but I don't have a good solution for you in any case.
> 
> Karl
> 
> 
>> On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard <olivier.tavard@francelabs.com>
wrote:
>> Hi Karl,
>> 
>> Do you think that I need to create a Jira issue relative to this bug ie that the
links extraction does not work if inside Javascript tags some code contain special characters
like '>', '< '?
>> 
>> Thanks,
>> Best regards,
>> 
>> Olivier
>> 
>> 
>> 
>>> Le 30 oct. 2018 à 12:05, Olivier Tavard <olivier.tavard@francelabs.com>
a écrit :
>>> 
>>> Hi Karl,
>>> 
>>> Thanks for your answer.
>>> I kept looking into this and I found what was the problem. The Javascript code
into the tags <script></scripts>  contained the character '<'. If so the links
extraction does not work with the web connector.
>>> 
>>> To reproduce it, I created this page hosted in local Apache then I indexed it
with MCF 2.11 out of the box.
>>> 
>>> in the first example the page was :
>>> <!DOCTYPE html>
>>> 
>>> <head>
>>> <title>test</title>
>>> <meta charset="utf-8" />
>>> <script type="text/javascript"></script>
>>> 
>>> </head>
>>> <body>
>>> 
>>> <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a>
>>> </body>
>>> 
>>> The links extraction was correct, in the debug log :
>>> DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an HttpClient
object
>>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For http://localhost:8888/testjs/test.html,
setting virtual host to localhost
>>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an HttpClient object
after 1 ms.
>>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for '/testjs/test.html'
>>>  INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL|http://localhost:8888/testjs/test.html|1540896372585+75|200|223|
>>> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 'http://localhost:8888/testjs/test.html'
is text, with encoding 'UTF-8'; link extraction starting
>>> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document 'http://localhost:8888/testjs/test.html',
found link to 'https://manifoldcf.apache.org/en_US/index.html'
>>> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content exclusion
rule supplied... returning
>>> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest 'http://localhost:8888/testjs/test.html'
>>> —
>>> In the second example, the code was pretty quite the same except that I included
the character '<' in the content of the script tags :
>>> <!DOCTYPE html>
>>> 
>>> <head>
>>> <title>test</title>
>>> <meta charset="utf-8" />
>>> <script type="text/javascript">a<b</script>
>>> 
>>> </head>
>>> <body>
>>> 
>>>     <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a>
>>>     
>>> </body>
>>> 
>>> The links extraction was not successful, the debug log indicates :
>>> DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an HttpClient
object
>>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For http://localhost:8888/testjs/test.html,
setting virtual host to localhost
>>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an HttpClient object
after 1 ms.
>>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for '/testjs/test.html'
>>>  INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL|http://localhost:8888/testjs/test.html|1540896493475+76|200|226|
>>> DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 'http://localhost:8888/testjs/test.html'
is text, with encoding 'UTF-8'; link extraction starting
>>> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content exclusion
rule supplied... returning
>>> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to ingest 'http://localhost:8888/testjs/test.html'
>>> —
>>> So special characters like the less than sign should be escaped in the code of
the web connector to preserve the links extraction.
>>> 
>>> Thanks,
>>> Best regards,
>>> 
>>> 
>>> Olivier 
>>> 
>>>> Le 29 oct. 2018 à 19:39, Karl Wright <daddywri@gmail.com> a écrit
:
>>>> 
>>>> Hi Olivier,
>>>> 
>>>> Javascript inclusion in the Web Connector is not evaluated.  In fact, no
Javascript is executed at all.  Therefore it should not matter what is included via javascript.
>>>> 
>>>> Thanks,
>>>> Karl
>>>> 
>>>> 
>>>>> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <olivier.tavard@francelabs.com>
wrote:
>>>>> Hi,
>>>>> 
>>>>> Regarding the web connector, I noticed that for specific websites, some
Javascript code can prevent the web connector to fetch correctly all the links present on
the page. Specifically, for websites that contain a deprecated version of New relic web agent
as js-agent.newrelic.com/nr-1071.min.js.
>>>>> After downloading the page locally and removing the reference to the
new relic agent browser, the links were correctly fetched in the page by the web connector.
So it seems that the Javascript injection here caused by the new relic agent was the cause
of the links not fetched in the page.
>>>>> This case is rare and concerns only old versions of New Relic agent.
But in a more generic way, would it be possible to block the javascript injection at the connector
level during the indexation ?
>>>>>  
>>>>> Thanks,
>>>>> Best regards,
>>>>> Olivier 
>>>>> 
>>>>> 
>>> 
>> 

Mime
View raw message