tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrey Barhatov (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-394) Missing spaces on html parsing
Date Thu, 25 Mar 2010 15:57:27 GMT
Missing spaces on html parsing

                 Key: TIKA-394
                 URL: https://issues.apache.org/jira/browse/TIKA-394
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.6
         Environment: Tomcat 6, Windows XP (russian locale)
            Reporter: Andrey Barhatov

On parsing such html code:


resulting text is:


But must be:

yet city1 city2

Code sample:

import java.io.*;
import org.apache.tika.metadata.*;
import org.apache.tika.parser.*;

public class test {

   public static void main(String[] args) throws Exception {
      Metadata metadata = new Metadata();
      metadata.set(Metadata.CONTENT_TYPE, "text/html");
      String content = "text<p>more<br>yet<select><option>city1<option>city2</select>";

      InputStream in = new ByteArrayInputStream(content.getBytes("UTF-8"));
      AutoDetectParser parser = new AutoDetectParser();
      Reader reader = new ParsingReader(parser, in, metadata, new ParseContext());
      char[] buf = new char[10000];
      int len;
      StringBuffer text = new StringBuffer();
      while((len = reader.read(buf)) > 0) {
         text.append(buf, 0, len);

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message