tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luís Filipe Nassif <lfcnas...@gmail.com>
Subject Re: 1.20?
Date Thu, 13 Dec 2018 18:33:47 GMT
Hi Tim,

Reading your great reports, I also saw some new exceptions with RAR files
in likely broken folder, but seems tika was able to extract some text from
them before. Do you know if those files are really broken and why tika
extracted text from them before?

Thank you,
Luis

Em qui, 13 de dez de 2018 às 13:02, Tim Allison <tallison@apache.org>
escreveu:

> Reports are here:
>
> http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>
> I'm going to revert the mp4 parser, and commit the few dependency
> upgrades I ran.
>
> The _major_ difference in content for ppt is explained by the
> duplication of header/footer info.  To confirm this, note that the
> values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> identical for nearly all ppt->ppt, but there are far more tokens in
> "num_tokens_a" vs "num_tokens_b".
>
> I also see that we're losing content in x-java and x-groovy, etc., but
> that's because we're now suppressing the style markup that our parser
> was (incorrectly, IMHO, inserting) -- check the values in
> "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> weight: 3 | family: 2
>
> In short, I think we're good to go.  Will roll rc1 later today or
> (more likely) tomorrow unless there are objections.
> On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <tallison@apache.org> wrote:
> >
> > Any blockers on 1.20?  I'm going to kick off the regression tests
> shortly.
> > On Fri, Nov 30, 2018 at 7:39 PM <loompa@gmail.com> wrote:
> > >
> > > Hi,
> > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <tallison@apache.org> wrote:
> > >
> > > > Dave,
> > > >   Should I try to get the Docker plugin working again?
> > > >
> > >
> > > That would be great. I think I may have went down the wrong path
> building
> > > an image at package time, as there doesn't seem to be an easy way to
> > > publish it as an Apache labelled org on Dockerhub unless it builds from
> > > source.
> > >
> > > I have some time over the weekend, so could update to where I got to
> and
> > > see what you think.
> > >
> > > Cheers,
> > > Dave
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message