tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shayan Tabrizi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files
Date Wed, 13 Mar 2013 19:34:12 GMT

    [ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13601533#comment-13601533
] 

Shayan Tabrizi commented on TIKA-713:
-------------------------------------

As I know, there is some kind of complexity in extracting Persian text from PDFs. For example,
selected text in Foxit Reader and other PDF readers is corrupted in most of the cases. The
only reader I used that could overcome this problem, is Adobe Acrobat. But I don't know what
exactly the source of the problem is. And solving this problem is very very necessary for
the Persian community. I see many people looking for a solution to this problem.
                
> Tika can not parse all of the persian pdf files
> -----------------------------------------------
>
>                 Key: TIKA-713
>                 URL: https://issues.apache.org/jira/browse/TIKA-713
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Ahmad Ajiloo
>         Attachments: Complex.pdf, ebrat.pdf, Simple2.pdf, Simple3.pdf
>
>
> Hello
> I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly
transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4
library and the text changed to right-to-left mode. but the mentioned problem didn't resolve.
insofar as Tika can not understand any charachter of input persian pdf file!
> {quote}
> I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian
text !
> --------------------------
> ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا
"فباي آلاء ربكما تكذبان" بخواند.‬
> ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط
"عثمانطه" تقريبا يك نصف صفحه است. (‬
> ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار
)ع( آمده كه چند چيز براي قوت حافظه مفيد است:‬
> ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا
آيه الكرسي‬
> ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک
گردن
> --------------------------
> Tike returns this output !
> --------------------------
>  92   @A   8 * B
>    C9D  !D       ) (?)   =/
>    >
>  
>  (<) ,    8 ;  
>  8 #
>    +  9!: 
>      L
>   #)    4   M() * 0>
>  * -3    IA J  
>   - 2   (+   G
>  H  -1
>  (+ J 5#+C     0T J (+  O - 6    R . (+  O - 5     PH. (+  O -4
> --------------------------
> {quote}
> thanks a lot

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message