lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Unicode Character Problem
Date Mon, 12 Dec 2016 19:18:47 GMT
> I don't see any weird character when I manual copy it to any text editor.

That's a good diagnostic step, but there's a chance that Adobe (or your viewer) got it right,
and Tika or PDFBox isn't getting it right.

If you run tika-app on the file [0], do you get the same problem?  See our stub on common
text extraction challenges with PDFs [1] and how to run PDFBox's ExtractText against your
file [2].

[0] java -jar tika-app.jar -i <input_dir> -o <output_dir>
[1] https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29
[2] https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems 

-----Original Message-----
From: Furkan KAMACI [mailto:furkankamaci@gmail.com] 
Sent: Monday, December 12, 2016 10:55 AM
To: solr-user@lucene.apache.org; Ahmet Arslan <iorixxx@yahoo.com>
Subject: Re: Unicode Character Problem

Hi Ahmet,

I don't see any weird character when I manual copy it to any text editor.

On Sat, Dec 10, 2016 at 6:19 PM, Ahmet Arslan <iorixxx@yahoo.com.invalid>
wrote:

> Hi Furkan,
>
> I am pretty sure this is a pdf extraction thing.
> Turkish characters caused us trouble in the past during extracting 
> text from pdf files.
> You can confirm by performing manual copy-paste from original pdf file.
>
> Ahmet
>
>
> On Friday, December 9, 2016 8:44 PM, Furkan KAMACI 
> <furkankamaci@gmail.com>
> wrote:
> Hi,
>
> I'm trying to index Turkish characters. These are what I see at my 
> index (I see both of them at different places of my content):
>
> aç  klama
> açıklama
>
> These are same words but indexed different (same weird character at 
> first one). I see that there is not a weird character when I check the 
> original PDF file.
>
> What do you think about it. Is it related to Solr or Tika?
>
> PS: I use text_general for analyser of content field.
>
> Kind Regards,
> Furkan KAMACI
>
Mime
View raw message