tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Piotr B. (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-343) some parsers produces glued words
Date Mon, 07 Dec 2009 11:51:18 GMT
some parsers produces glued words
---------------------------------

                 Key: TIKA-343
                 URL: https://issues.apache.org/jira/browse/TIKA-343
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.5, 0.6
            Reporter: Piotr B.


Some parsers ignores word/line delimiters. 

Document:
"<html><head></head><body>test<br>test</body></html>"
is decoded by HtmlParser to "testtest".

I think the HtmlParser.mapSafeElement method should be extended by:

        if ("BR".equals(name)) return "br";
        if ("DIV".equals(name)) return "div";
        if ("HR".equals(name)) return "hr";
        if ("ADDRESS".equals(name)) return "address";
        if ("FIELDSET".equals(name)) return "fieldset";
        if ("FORM".equals(name)) return "form";
        if ("NOSCRIPT".equals(name)) return "noscript";
        if ("NOFRAMES".equals(name)) return "noframes";

Also application/xml documents are parsed by removing unknown tags instead of replacing them
into spaces.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message