tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier M (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1794) TXTParser removes form feed characters
Date Mon, 16 Nov 2015 10:05:11 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Olivier M updated TIKA-1794:
----------------------------
    Description: 
Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when parsing a text
file.

If I compare the hex bytes of the original file and the hex bytes of the extracted text I
can see that the 0C character is replaced by  EF BF BD which is the UTF-8 replacement character.

{code:title=Test.java|borderStyle=solid}
	public static void main(String[] args) {
		InputStream is = null;
		
		try {
			is = new FileInputStream("form_feed.txt");
			
			AutoDetectParser parser = new AutoDetectParser();
			Writer stringWriter = new StringWriter();
			ContentHandler handler = new BodyContentHandler(stringWriter);
			Metadata metadata = new Metadata();
			parser.parse(is, handler, metadata);
			
			String extractedText = stringWriter.toString();
			System.out.println(extractedText);
			
			String hex = Hex.encodeHexString(extractedText.getBytes("UTF-8"));
			
			System.out.println(hex); //0C replaced by EFBFBD

		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			IOUtils.closeQuietly(is);
		}
	}
{code}


  was:
Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when parsing a text
file.

If I compare the hex bytes of the original file and the hex bytes of the extracted text I
can see that the 0C character is replaced by  EF BF BD which is the UTF-8 replacement character.



> TXTParser removes form feed characters
> --------------------------------------
>
>                 Key: TIKA-1794
>                 URL: https://issues.apache.org/jira/browse/TIKA-1794
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>            Reporter: Olivier M
>            Priority: Minor
>              Labels: parser, txt
>         Attachments: form_feed.txt
>
>
> Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when parsing
a text file.
> If I compare the hex bytes of the original file and the hex bytes of the extracted text
I can see that the 0C character is replaced by  EF BF BD which is the UTF-8 replacement character.
> {code:title=Test.java|borderStyle=solid}
> 	public static void main(String[] args) {
> 		InputStream is = null;
> 		
> 		try {
> 			is = new FileInputStream("form_feed.txt");
> 			
> 			AutoDetectParser parser = new AutoDetectParser();
> 			Writer stringWriter = new StringWriter();
> 			ContentHandler handler = new BodyContentHandler(stringWriter);
> 			Metadata metadata = new Metadata();
> 			parser.parse(is, handler, metadata);
> 			
> 			String extractedText = stringWriter.toString();
> 			System.out.println(extractedText);
> 			
> 			String hex = Hex.encodeHexString(extractedText.getBytes("UTF-8"));
> 			
> 			System.out.println(hex); //0C replaced by EFBFBD
> 		} catch (Exception e) {
> 			e.printStackTrace();
> 		} finally {
> 			IOUtils.closeQuietly(is);
> 		}
> 	}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message