tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Jakubik <p...@purediscovery.com>
Subject Faster charset detection or turn off charset detection?
Date Thu, 12 Aug 2010 19:37:55 GMT
Hi,

I'm wondering if there is a way to turn off character set detection when
parsing with the AutoDetectParser, or if there is a way to speed up
character set detection.

I ran a test that converted 52,717 documents to text. The documents were
emails embedded in a .tar file.

With character set detection, the test to 220 seconds. Without character set
detection, the test took 21 seconds and only 6% of that time was spent in
Tika.

According to a profiler, the following methods took the bulk of the runtime
when character set detection was used:
61.7%  org.apache.tika.parser.txt.CharsetRecog_sbcs$NGramParser.parse
  4.3%
 org.apache.tika.parser.txt.CharsetRecog_sbcs$CharsetRecog_IBM420_ar.isLamAlef
  3.1%
 org.apache.tika.parser.txt.CharsetRecog_sbcs$CharsetRecog_IBM420_ar.unshapeLamAlef
  2.6%  org.apache.tika.parser.txt.CharsetDetector.setText(byte[ ])
  2.3%  org.apache.tika.parser.txt.CharsetRecog_mbcs.match

One problem that seems to contribute to this is that every character set is
tested for each document, instead of starting with common character sets and
stopping as soon as an adequate character set is found.

To turn off character set detection, I created a new class that is
essentially the TXTParser with character set detection removed. I then
replaced every instance of TXTParser in AutoDetectParser's map of parsers
with a text parser that does not determine the character set.

I'm left with the following questions:
- Can character set detection be sped up?
- If character set detection can't be sped up, is there an easier way to
turn it off?
- If character set detection can't be sped up and there isn't an easier way
to turn off character set detection, could an easier way to turn off
character set detection be added?

Thanks for your help,
Paul

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message