[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrés Aguilar-Umaña updated TIKA-1373:
---------------------------------------
Description:
When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java
files), the handler gets no text:
I have this test program:
{code}
String data = "public class HelloWorld {}";
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("Text extracted: "+bch.toString())
{code}
It returns (using the SourceCodeParser):
{code} > Text extracted: {code}
But when I use this code:
{code}
String data = "public class HelloWorld {}";
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/plain");
try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception
e) { e.printStackTrace(); }
System.out.println("Text extracted: "+bch.toString())
{code}
The Text Parser is used and I get:
{code} > Text extracted: public class HelloWorld {} {code}
I have also tested this command:
{code}
> java -jar tika-app-1.5.jar -t D:\text.java
(no text)
{code}
was:
When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java
files), the handler gets no text:
I have this test program:
{code}
String data = "public class HelloWorld {}";
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("Text extracted: "+bch.toString())
{code}
It returns (using the SourceCodeParser):
{code} > Text extracted: {code}
But when I use this code:
{code}
String data = "public class HelloWorld {}";
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/plain");
try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception
e) { e.printStackTrace(); }
System.out.println("Text extracted: "+bch.toString())
{code}
The Text Parser is used and I get:
{code} > Text extracted: public class HelloWorld {} {code}
I have also tested this command:
{code}
> java -jar tika-app-1.5.jar -t D:\text.java
(no text)
{code}
> AutoDetectParser extracts no text when SourceCodeParser is selected
> -------------------------------------------------------------------
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.5
> Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e.
java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
> autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
> e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser):
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception
e) { e.printStackTrace(); }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command:
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
> (no text)
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
|