tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmad Ajiloo (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-713) Tika can not parse all of the persian pdf files
Date Tue, 13 Sep 2011 06:02:09 GMT

     [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ahmad Ajiloo updated TIKA-713:
------------------------------

    Attachment: ebrat.pdf

this is a persian pdf file that Tika can't parse it.

> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>             Fix For: 0.9
>
>         Attachments: ebrat.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly
transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4
library and the text changed to right-to-left mode. but the mentioned problem didn't resolve.
insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian
text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا
"فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط
"عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار
)ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا
آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک
گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message