tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anirban Mitra (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB
Date Thu, 17 Nov 2011 21:16:52 GMT

    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152329#comment-13152329
] 

Anirban Mitra commented on TIKA-734:
------------------------------------

Hello ,

I am using the following code.

		constructor()
		{
		this.context = new ParseContext();
		this.parser = new AutoDetectParser();
		this.context.set(Parser.class, parser);
		this.outputStream = argOutputStream;
		this.fileInputStream = argIp;

		}

		function convert()
		{	
		Metadata metadata = new Metadata();
		metadata.set(Metadata.RESOURCE_NAME_KEY, fileName);
		BodyContentHandler contentHandler = new BodyContentHandler(this.outputStream);  // outputStream
is a pipedOutputStream
           	parser.parse(fileInputStream , contentHandler, metadata, context);
		}

The reason I am using the parsing mechanism like above because I wanted to use a pipedInput
attached to a pipedOutputStream so that
I can use it more efficiently. While TIKA reads the file, pass the parsed content to pipedStream
, another thread will pickup the
Text from pipedStream and start processing it. So the whole idea is if I need to parse an
30 MB file, I do not need to wait for TIKA
To parse the complete file , instead it could keep parsing a small chunk of file and send
for processing by other threads.

Still I am seeing the performance with respect to time is not improved much. Do you have any
suggestion on the way I am using TIKA ?
Is that a correct way of using TIKA? 

I am not using tika.parseToString() because it returns the whole parsing results string at
once and till then the other threads would be blocked.

Hope I could explain my issue. Appreciate a response from your end.


Thanks
Anirban

		


                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory
- 500MB
>            Reporter: Anirban Mitra
>         Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file
and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any
resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message