tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Parser does not produce proper sentence breaks?
Date Mon, 03 Jun 2013 15:53:33 GMT
First off, those 3 *'s you see are annoying :)  They are coming from
the master slide, due to this issue:

    https://issues.apache.org/jira/browse/TIKA-1067

Would be nice to figure out how to stop this "false text" from coming out.

Second, I think the PPT/X parsers do not put any information about the
breaks between bullets today ... I think it would make sense to
somehow preserve this information (separate <div/>s maybe?).  I'm not
sure how easy this is to fix (I don't know if we get the info out of
POI today...).

Can you open an issue and attach your test PPT/X?  Thanks.

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 29, 2013 at 4:43 PM, Shai Erera <serera@gmail.com> wrote:
> Hi
>
> I've started to use Tika a couple of days ago, so it could very well be
> that I'm using the wrong ContentHandler, Parser configuration and what not.
> I hope I do, and there's a simple fix to the following problem:
>
> I index documents (for this discussion PPT) and then search and produce
> search highlights (using Lucene). I've noticed that the PowerPoint
> documents produce rather longish highlights. I use Lucene's
> PostingsHighlighter which breaks the content using
> BreakIterator.sentenceInstance (
> http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html), and
> for PPT documents, which often (I guess) do not contain sentence breaks
> (e.g. '.') at the end of bullets, this results in very long sentences.
>
> I wrote a simple program which parses a PPT file with one slide that looks
> like this:
>
> Slide title
>
>    - Short bullet
>    - Long bullet which will eventually end with a dot, but not just yet.
>    - Long bullet which doesn't end with a dot, not now and not ever
>    - A bullet which
>    is split into
>    multiple lines
>
> That's it, very simple. What I would expect (or hoped!) is that 5 sentences
> will be output, 1 for the slide's title and one for each bullet. But
> rather, if I parse the file with BodyContentHandler, and then invoke the
> sentence BreakIterator, I get this:
>
> *
> *
> *
>
> Slide Title
> Short bullet
> Long bullet which will end eventually with a dot, but not just yet.
>
> ++++++
> Long  bullet which doesn’t end with a dot, not now and not ever
> A bullet which is split into multiple lines
>
>
>
>
>
> ++++++
>
> The '++++++' are marks that I print after each sentence the BreakIterator
> detects. Here's the code which invokes the iterator:
>
>     BreakIterator iterator = BreakIterator.getSentenceInstance();
>     iterator.setText(content);
>     for (int start = iterator.first(), end = iterator.next(); end !=
> BreakIterator.DONE; start = end, end = iterator.next()) {
>         System.out.println(content.substring(start, end));
>         System.out.println("++++++");
>     }
>
> As you can see, the bullet which ends with a dot '.' also results in a new
> sentence. And if I remove the '.', so is the sentence end print removed as
> well.
>
> I then thought perhaps I should get the "raw" output from Tika, and
> followed TikaCLI code to use TransformerHandler (with method "xml") in
> order to get the output XML. I thought that perhaps by doing that I can
> replace whatever markers Tika puts with sentence breaks, be it <br/> or
> </p>, but I don't see such markers:
>
> ...
> <body><div class="slideShow"><div class="slide"><p
> class="slide-master-content">*<br/>
> *<br/>
> *<br/>
> </p>
> <p class="slide-content">Slide Title<br/>
> Short bullet
> Long bullet which will eventually end with a dot, but not just yet.
>
> ++++++
> Long  bullet which doesn’t end with a dot, not now and not never
> A bullet which is split into multiple lines<br/>
> </p>
> </div>
> </div>
> <div class="slideNotes"/>
> </body>
>
> Is there a way I can make Tika output sentence boundaries for such bullets?
> Or maybe output a marker which i can then replace w/ a valid sentence break
> (there are few I can pick according to
> http://www.unicode.org/reports/tr29/#Sentence_Boundaries).
>
> I did notice there are \n characters in the output text, but I don't think
> it's very generic to replace every \n with a '.', as the multi-line bullet
> shows?
>
> Shai

Mime
View raw message