tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek (Jira)" <j...@apache.org>
Subject [jira] [Created] (TIKA-3024) Extra whitespace appended within a tag element's text
Date Thu, 09 Jan 2020 09:42:00 GMT
Vivek  created TIKA-3024:
----------------------------

             Summary: Extra whitespace appended within a tag element's text
                 Key: TIKA-3024
                 URL: https://issues.apache.org/jira/browse/TIKA-3024
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.20
            Reporter: Vivek 


Website: [http://www.thevanitycase.com/about-us.php]



While parsing the content of the page using Tika Parser, extra whitespace ("  ") is appended
in the text "Tel: +91-22-61801700". That is, 
Expected text: "<text before this>Tel: +91-22-61801700<text after this>"

Actual text: "<text before this>Tel: +91-22-6180170  0<text after this>"

The JS path of the element: body > div > div:nth-child(6) > div > div.footer-full.footer-btm
> div > p > span

 

Usually, double whitespace will be appended between every tag element text. But here double
whitespace is appended within a tag element text.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message