tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: OutOfMemory exception
Date Tue, 23 Mar 2010 02:34:00 GMT
Hi Sangri,

How big is the XML file you're trying to parse?

If you find it's large (on the order of 100s of MBs-1GBs), it's certainly possible it could
take a while (depending on your underlying machine architecture) to parse it. If you need
to increase the heap size for Tika, you would do it the same way you would do any program
using the JVM, e.g.,:

java -Xms<starting MB here> -Xmx(how large in MB to grow) -jar tika-app-0.6.jar -g theXMLfile.xml

For example:

java -Xms256m -Xmx512m -jar tika-app-0.6.jar -g theXMLfile.xml


Would set the starting heap at 256 MB, and let it grow to 512MB.

HTH,
Chris


On 3/22/10 6:53 PM, "sangri" <snaggle.sanga@gmail.com> wrote:

Hello

I'm using Tika on my final year project. I want to parse an XML document
that is very large around 90MB. I have Apache Tika 0.6 and when I run the
command:

java -jar tika-app-0.6.jar -g theXMLfile.xml

I see the output on the command prompt, showing the data extracted from the
XML file. But after like 30 minutes, Tika crashes with an OutOfMemory
Exception. Can someone help me with this issue? How can I fix this, is there
a way to set the heap size when running Tika?

Thanks in advance.



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message