tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek (Jira)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-3024) Extra whitespace appended within a tag element's text
Date Thu, 09 Jan 2020 13:33:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vivek  updated TIKA-3024:
-------------------------
    Description: 
Website: [http://www.thevanitycase.com/about-us.php]

While parsing the content of the page using Tika Parser, it splits the text in the tag and
sends it to crawler4j for content handling. But the text is contained within a single tag
(span tag). The content handler appends extra whitespace ("  ") as it normally does for any
text received

Text: "Tel: +91-22-61801700". 
 That is, 
 Expected text: "<text before this>Tel: +91-22-61801700<text after this>"

Actual text: "<text before this>Tel: +91-22-6180170  0<text after this>"

The JS path of the element: body > div > div:nth-child(6) > div > div.footer-full.footer-btm
> div > p > span

  was:
Website: [http://www.thevanitycase.com/about-us.php]

While parsing the content of the page using Tika Parser, it splits the text in the tag and
sends it to crawler4j for content handling. But the text is contained within a single tag
(span tag). The content handler appends extra whitespace ("  ") as it normally does for any
text received 

Text: "Tel: +91-22-61801700". 
That is, 
 Expected text: "<text before this>Tel: +91-22-61801700<text after this>"

Actual text: "<text before this>Tel: +91-22-6180170  0<text after this>"

The JS path of the element: body > div > div:nth-child(6) > div > div.footer-full.footer-btm
> div > p > span

 

Usually, double whitespace will be appended between every tag element text. But here double
whitespace is appended within a tag element text as parser detects it as the content of 2
different HTML tags.


> Extra whitespace appended within a tag element's text
> -----------------------------------------------------
>
>                 Key: TIKA-3024
>                 URL: https://issues.apache.org/jira/browse/TIKA-3024
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16, 1.20
>            Reporter: Vivek 
>            Priority: Major
>
> Website: [http://www.thevanitycase.com/about-us.php]
> While parsing the content of the page using Tika Parser, it splits the text in the tag
and sends it to crawler4j for content handling. But the text is contained within a single
tag (span tag). The content handler appends extra whitespace ("  ") as it normally does for
any text received
> Text: "Tel: +91-22-61801700". 
>  That is, 
>  Expected text: "<text before this>Tel: +91-22-61801700<text after this>"
> Actual text: "<text before this>Tel: +91-22-6180170  0<text after this>"
> The JS path of the element: body > div > div:nth-child(6) > div > div.footer-full.footer-btm
> div > p > span



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message