tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Tikhonov <o...@apache.org>
Subject Re: Having Problem in Word Count and Language Detaction
Date Sat, 26 Oct 2013 19:09:52 GMT
This one is better"
https://issues.apache.org/jira/browse/TIKA-546



On Sat, Oct 26, 2013 at 10:05 PM, Oleg Tikhonov <oleg@apache.org> wrote:

> Hi Animesh,
> my wild guess is that N-gram profile for Chinese wasn't trained pretty
> well. Try recreate Chinese language profile.
>
> Have a look here:
>
> http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/section6.html
>
> Hope it helps.
>
>
> On Sat, Oct 26, 2013 at 8:48 PM, Chris Mattmann <mattmann@apache.org>wrote:
>
>> Hi Animesh,
>>
>> Please detail your issue here on dev@tika.apache.org and I'm sure
>> someone can help.
>>
>> Cheers,
>> Chris
>>
>>
>> -----Original Message-----
>> From: Animesh Kumar <animesh.sarag@gmail.com>
>> Date: Wednesday, October 23, 2013 9:15 PM
>> To: "dev-owner@tika.apache.org" <dev-owner@tika.apache.org>
>> Subject: Fwd: Having Problem in Word Count and Language Detaction
>>
>> >
>> >
>> >Sir/Mam,
>> >I am developing a web based software which use Apache Tika for getting
>> >Language and words Count of Uploaded file. Its working fine for English,
>> >Japanese , Hindi etc but giving wrong words count for Chinese. I am using
>> >tika-app-1.4.jar .
>> >and there is an another problem in word counting of file format different
>> >from doc and docx
>> >
>> >
>> >--
>> >With Thanks & Regards
>> >Animesh Kumar
>> >+918927992397 <tel:%2B918927992397>
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >--
>> >With Thanks & Regards
>> >Animesh Kumar
>> >+918927992397 <tel:%2B918927992397>
>> >
>> >
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message