tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: 1.20?
Date Thu, 13 Dec 2018 15:02:09 GMT
Reports are here:

http://162.242.228.174/reports/tika_1_20-pre-rc1.zip

I'm going to revert the mp4 parser, and commit the few dependency
upgrades I ran.

The _major_ difference in content for ppt is explained by the
duplication of header/footer info.  To confirm this, note that the
values for "num_unique_tokens_a" and "num_unique_tokens_b" are
identical for nearly all ppt->ppt, but there are far more tokens in
"num_tokens_a" vs "num_tokens_b".

I also see that we're losing content in x-java and x-groovy, etc., but
that's because we're now suppressing the style markup that our parser
was (incorrectly, IMHO, inserting) -- check the values in
"top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
weight: 3 | family: 2

In short, I think we're good to go.  Will roll rc1 later today or
(more likely) tomorrow unless there are objections.
On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <tallison@apache.org> wrote:
>
> Any blockers on 1.20?  I'm going to kick off the regression tests shortly.
> On Fri, Nov 30, 2018 at 7:39 PM <loompa@gmail.com> wrote:
> >
> > Hi,
> > On Wed, 21 Nov 2018 at 13:00, Tim Allison <tallison@apache.org> wrote:
> >
> > > Dave,
> > >   Should I try to get the Docker plugin working again?
> > >
> >
> > That would be great. I think I may have went down the wrong path building
> > an image at package time, as there doesn't seem to be an easy way to
> > publish it as an Apache labelled org on Dockerhub unless it builds from
> > source.
> >
> > I have some time over the weekend, so could update to where I got to and
> > see what you think.
> >
> > Cheers,
> > Dave

Mime
View raw message