manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1
Date Thu, 31 Mar 2011 12:35:32 GMT
It might be worth cross-posting this to the Tika user or dev list.
Jukka Zitting is one of the principal Tika developers and he's also a
committer for MCF, but I'm not sure he'll notice it go by otherwise.

In case you're wondering how to update the MCF FAQ, it's in the Wiki
so all you need to do is sign up and you'll be able to update it.
https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ

Karl

On Thu, Mar 31, 2011 at 6:59 AM, Erlend Garåsen <e.f.garasen@usit.uio.no> wrote:
>
> Oh, there's more unfortunately. Some of the Tika dependencies need to be
> further updated. I couldn't parse the date from PDF documents correctly. I'm
> not quite sure which of the extracting libraries causing this problem
> (probably pdfbox). Anyway, I can now extract contents from the following
> document formats without any problems:
> - HTML
> - RTF
> - DOC
> - DOCX
> - ODT
> - XLSX
> - XLS
> - SXW
> - PDF
>
> I'm using the following jars:
> apache-solr-cell-1.4.2-dev.jar
> geronimo-stax-api_1.0_spec-1.0.1.jar
> poi-scratchpad-3.7.jar
> asm-3.1.jar
> icu4j-4_6.jar
> rome-0.9.jar
> bcmail-jdk15-1.45.jar
> jempbox-1.3.1.jar
> tagsoup-1.2.jar
> bcprov-jdk15-1.45.jar
> metadata-extractor-2.4.0-beta-1.jar
> tika-core-0.8.jar
> boilerpipe-1.1.0.jar
> netcdf-4.2.jar
> tika-parsers-0.8.jar
> commons-compress-1.1.jar
> pdfbox-1.3.1.jar
> commons-logging-1.1.1.jar
> poi-3.7.jar
> xercesImpl-2.8.1.jar
> dom4j-1.6.1.jar
> poi-ooxml-3.7.jar
> xml-apis-1.0.b2.jar
> fontbox-1.3.1.jar
> poi-ooxml-schemas-3.7.jar
> xmlbeans-2.3.0.jar
>
> But I still have some problems with PDF documents[1]. I'm not sure whether
> it is a pdfbox bug, but Norwegian characters like æ, ø and å cannot be
> displayed correctly after Solr has indexed the document. The characters are
> replaced by a question mark.
>
> [1] http://ridder.uio.no/dokument.pdf
>
> Erlend
>
> On 30.03.11 18.09, Karl Wright wrote:
>>
>> Certainly it makes sense to start with the FAQ, especially for places
>> where you are tripping over known bugs.  We can always do a site page
>> later.
>>
>> Thanks!
>> Karl
>>
>> On Wed, Mar 30, 2011 at 12:07 PM, Erlend Garåsen
>> <e.f.garasen@usit.uio.no>  wrote:
>>>
>>> On 30.03.11 18.00, Karl Wright wrote:
>>>>
>>>> It would be great if this information went at least into the FAQ, and
>>>> even better if we added a page to the site documentation.  I'm
>>>> thinking maybe a whole page titled "Integrating with Solr", which
>>>> would walk you through the process and the pitfalls.  What do you
>>>> think?
>>>
>>> Yes, I think so.
>>>
>>> The next version of Solr will probably be released soon, and then it will
>>> be
>>> much easier to integrate Solr. Maybe it is sufficient to add the
>>> information
>>> into the FAQ since the problem mentioned only affects 1.4.1?
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Mime
View raw message