lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <>
Subject RE: Unicode Character Problem
Date Mon, 12 Dec 2016 19:18:47 GMT
> I don't see any weird character when I manual copy it to any text editor.

That's a good diagnostic step, but there's a chance that Adobe (or your viewer) got it right,
and Tika or PDFBox isn't getting it right.

If you run tika-app on the file [0], do you get the same problem?  See our stub on common
text extraction challenges with PDFs [1] and how to run PDFBox's ExtractText against your
file [2].

[0] java -jar tika-app.jar -i <input_dir> -o <output_dir>

-----Original Message-----
From: Furkan KAMACI [] 
Sent: Monday, December 12, 2016 10:55 AM
To:; Ahmet Arslan <>
Subject: Re: Unicode Character Problem

Hi Ahmet,

I don't see any weird character when I manual copy it to any text editor.

On Sat, Dec 10, 2016 at 6:19 PM, Ahmet Arslan <>

> Hi Furkan,
> I am pretty sure this is a pdf extraction thing.
> Turkish characters caused us trouble in the past during extracting 
> text from pdf files.
> You can confirm by performing manual copy-paste from original pdf file.
> Ahmet
> On Friday, December 9, 2016 8:44 PM, Furkan KAMACI 
> <>
> wrote:
> Hi,
> I'm trying to index Turkish characters. These are what I see at my 
> index (I see both of them at different places of my content):
> aç  klama
> açıklama
> These are same words but indexed different (same weird character at 
> first one). I see that there is not a weird character when I check the 
> original PDF file.
> What do you think about it. Is it related to Solr or Tika?
> PS: I use text_general for analyser of content field.
> Kind Regards,
> Furkan KAMACI
View raw message