lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Calò <anton.c...@gmail.com>
Subject Re: Tika trouble
Date Mon, 16 Nov 2009 11:06:56 GMT
What I could try to say is that if you want to index a Pdf, then you should
use a Pdf extractor. A Pdf Extractor is able to extract the text content and
the metadata of the files. I suppose you have just opened and indexed the
pdf as is. So you stored bynary data and stop. For my applciation I've used
PdfExtractor, but also pdfBox project could be used.

Antonio

2009/11/16 Markus Jelsma - Buyways B.V. <markus@buyways.nl>

> Anyone has a clue?
>
>
>
> > List,
> >
> >
> > I somehow fail to index certain pdf files using the
> > ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but
> > modified schema. I have a very simple schema for this case using only
> > and ID field, a timestamp field and two dynamic fields; ignored_* and
> > attr_* both indexed, stored and multivalued strings. They are
> > multivalued simple because some HTML files fail when storing multiple
> > hyperlinks.
> >
> > I have posted multiple files to
> > http://.../update/extract?literal.id=doc1 including:
> > 1. the whitepaper at
> > http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP
> > 2. the html file of the frontpage of http://nu.nl/
> > 3. another pdf at
> >
> http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A<http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F%7Echristos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A>
> >
> > For each document i have a corresponding select/?q=*:*:
> >
> >
> > 1. No text? Should i see something?
> >
> > <doc><str name="id">doc1</str>
> > <arr name="ignored_content_type">
> > <str>application/octet-stream</str>
> > </arr>
> > <arr name="ignored_stream_content_type">
> > <str>
> > text/xml; charset=UTF-8;
> > boundary=----------------------------cf57b4ad644d
> > </str>
> > </arr>
> > <arr name="ignored_stream_size">
> > <str>491238</str>
> > </arr>
> > <arr name="ignored_text">
> > <str>        </str>
> > </arr>
> > <date name="timestamp">2009-11-12T12:17:23.016Z</date>
> > </doc>
> >
> >
> > 2. Plenty of data, this seems to be ok
> >
> > <doc>
> > <str name="id">doc1</str>
> > <arr name="ignored_content_type">
> > <str>application/xhtml+xml</str>
> > </arr>
> > <arr name="ignored_links">
> > <str>http://www.nu.nl/</str>
> > <str>http://www.nu.nl/</str>
> > <str>http://www.nu.nl/algemeen/</str>
> > <str>http://www.nu.nl/economie/</str>
> > ....
> > <arr name="ignored_stream_content_type">
> > <str>
> > text/xml; charset=UTF-8;
> > boundary=----------------------------b6e44d087bdd
> > </str>
> > </arr>
> > <arr name="ignored_stream_size">
> > <str>36991</str>
> > </arr>
> > <arr name="ignored_text">
> > <str>
> > A LOT OF TEXT HERE
> > </str>
> > </arr>
> > <date name="timestamp">2009-11-12T12:19:15.415Z</date>
> > </doc>
> >
> >
> > 3. a lot of garbage
> >
> > <doc>
> > <str name="id">doc1</str>
> > <arr name="ignored_content_encoding">
> > <str>windows-1252</str>
> > </arr>
> > <arr name="ignored_content_language">
> > <str>fr</str>
> > </arr>
> > <arr name="ignored_content_type">
> > <str>text/plain</str>
> > </arr>
> > <arr name="ignored_language">
> > <str>fr</str>
> > </arr>
> > <arr name="ignored_stream_content_type">
> > <str>
> > text/xml; charset=UTF-8;
> > boundary=----------------------------83df0fd4d358
> > </str>
> > </arr>
> > <arr name="ignored_stream_size">
> > <str>361458</str>
> > </arr>
> > <arr name="ignored_text">
> > <str>
> > A LOT OF GARBAGE HERE including
> >
> > ió½·Þp™ó 4­0›
> > š©xÓ ^ CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\
�s%OîÐÙIÑYRäŠ ;4
> > ¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ
»ý)³Å=j¶B¢)`  Ñ
> > „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´
þžŽ
> > $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë
> > MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O
3Å~�8¿L  ‡ëŽó©pk _
> > Ša Â=u×; (ä<¹@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D»   @fI$0°�Î
Ù·p“Œ,Øâ  †¶v
> > ¤v1#8¼0 ›  èð€-†šZ 6¾  ! ñb ˆbˆ¤v)LS)T X² ¬ l!@€  6E$Q
> > endstream
> > endobj
> > 137 0
> >
> obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>>
> > endobj
> > 138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942
> > 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV
> > 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>>
> > endobj
> > 139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>>
> > endobj
> > 140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0
> > R]/Type/Pages/Parent 139 0 R>>
> > endobj
> > 141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0
> > R]/Type/Pages/Parent
> >
> > ....
> >
> > </str>
> > </arr>
> > <date name="timestamp">2009-11-12T12:21:28.306Z</date>
> > </doc>
> >
> >
> > Any ideas? Why doesn't the whitepaper produce any results and why is the
> > next whitepaper full of garbage? At least i'm happy that HTML works
> > fine.
> >
> >
> >
> > Regards,
> >
> > -
> > Markus Jelsma          Buyways B.V.
> > Technisch Architect    Friesestraatweg 215c
> > http://www.buyways.nl  9743 AD Groningen
> >
> >
> > Alg. 050-853 6600      KvK  01074105
> > Tel. 050-853 6620      Fax. 050-3118124
> > Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
> >
>



-- 
Antonio Calò
------------------------------------------
Software Developer Engineer
@ Intellisemantic
Mail anton.calo@gmail.com
Tel. 011-56.90.429
------------------------------------------

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message