tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alejandro Caceres <acace...@hyperiongray.com>
Subject Re: Tika Issue?
Date Thu, 16 Jul 2015 19:56:01 GMT
Gah I messed up the bug story. You're right, that text is categorized as
en, I screwed up with the file. Here is a better/more accurate summary of
what I'm seeing, with some examples. Pretend the previous email was all a
terrible dream.

There appear to be two potential issues going on, let's start with the
language categorization because I already brought it up. I've attached 3
files below for reference. language_no_1 and language_no_2 are both picked
up as Norwegian, I suspect this is because there's a small amount of text.
language_lt_2 is probably the most interesting to me, this text is picked
up as Lithuanian, seems to have a good amount of text, but has a footer
that is in various languages. I suspected that was throwing it off, however
most of the text is definitely English so perhaps something else is going
on.

The other issue I'm seeing is with the parser, but maybe I've misunderstood
something. Here is some code:

*import requests*
*from tika import parser*
*from tika import language*

*r = requests.get("http://ferretspatternu.ucoz.com/
<http://ferretspatternu.ucoz.com/>")*
*string_parsed = parser.from_buffer(r.text)*
*lang = language.from_buffer(string_parsed["content"])*
*print string_parsed["content"]*
*print lang*

The language is picked up Lithuanian, however I see why. The "content"
field looks like it is not plain text, but instead raw HTML. In other
documents this field looks like it contains sanitized text... or am I
missing something?

Anyway, hope that's all a little bit clearer! Let me know what you think.

Alex

PS this is with the latest version of tika-python running tika server 1.9





On Thu, Jul 16, 2015 at 2:58 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Alex, what version of Tika Python are you using? And moreover
> what version of Tika? I’m CC’ing folks on dev@tika.a.o hope you
> don’t mind.
>
> I took the file you attached and saved it as blah.txt and ran
> tika-python (with 1.9 tika) against it:
>
> [mattmann-0420740:~] mattmann% tika-python detect type blah.txt
> tika.py: Retrieving
> http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/
> 1.9/tika-server-1.9.jar to
> /var/folders/05/5qw82z2d77q16fhxxhwt22tr0000gq/T/tika-server.jar.
> [(200, u'text/plain')]
> [mattmann-0420740:~] mattmann% tika-python language file blah.txt
> [(200, u'en')]
> [mattmann-0420740:~] mattmann%
>
> Is what what you would expect? In general the language detection using
>
> N-grams and gets better when there is more text as a sample but it can
> get fooled sometimes too.
>
> Let me know what you think.
>
> Cheers,
> Chris
>
> CC / memex-jpl@
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Alejandro Caceres <acaceres@hyperiongray.com>
> Date: Thursday, July 16, 2015 at 11:53 AM
> To: jpluser <chris.a.mattmann@jpl.nasa.gov>
> Cc: Amanda Towler <atowler@hyperiongray.com>
> Subject: Tika Issue?
>
> >Hey Chris,
> >
> >
> >I was about to submit this as a bug, but figured I'd run it by you first.
> >Maybe you've encountered a similar issue.
> >
> >
> >I'm doing some basic language categorization of websites, I saw that the
> >Tika server/tika-python returns content as plain text, which is great to
> >send to Tika language categorization (and just generally useful). However,
> > it seemed to get very confused with sites that have footers in various
> >languages, this is actually really common in the results we've found. For
> >example, we have a totally English site and at the bottom is some links
> >to the same site in other languages. This
> > page, even though it's mostly English, gets categorized as a seemingly
> >random language (like Lithuanian).
> >
> >
> >As a workaround we tried running the web pages through a text
> >summarization algo using lxml-readability, which gives us back a subset
> >of the text on a page. My thinking was this would most likely strip
> >footers and headers
> > and give us back a decent representative sample of text on the page. The
> >results seem to have improved a bit, but we're still getting some funky
> >results where English pages are categorized as a seemingly random
> >language, in many cases these pages seem pretty
> > obviously (to the human eye) to be English.
> >
> >
> >I wonder if someone at JPL (I don't see anyone from JPL here right now)
> >could shed some light on why this might be happening. I've attached a
> >couple of samples below. Also let me know if you'd like me to file any
> >bugs anywhere
> > to better track this, I just wanted to shoot this to you first to see if
> >perhaps I was missing something obvious.
> >
> >
> >
> >Alex
> >
> >
> >--
> >___
> >
> >Alejandro Caceres
> >Hyperion Gray, LLC
> >Owner/CTO
> >
>
>


-- 
___

Alejandro Caceres
Hyperion Gray, LLC
Owner/CTO

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message