tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: 1.20?
Date Fri, 14 Dec 2018 22:33:38 GMT
Thank you, again, Luís Filipe Nassif!  There's no point in having
reports unless we pay attention to them :P.  I reverted junrar to
where it was in 1.19.1. I also reverted jackcess based on the reports.

All,
  On the theory that it isn't a great idea to push to production on a
Friday.  I'm going to let the recent changes rest over the weekend.
I'll rerun some tests on a subset of the regression corpus on Monday
and then roll rc1.  If anyone wants to kick the tires on the recent
version changes, including parsers that depend on the upgraded guava,
that'd be great!

Onward!

Cheers,

           Tim

On Thu, Dec 13, 2018 at 5:34 PM Tim Allison <tallison@apache.org> wrote:
>
> Let me actually take a look before answering. Sorry!
>
> On Thu, Dec 13, 2018 at 5:30 PM Tim Allison <tallison@apache.org> wrote:
>>
>>  Thank you for reading the reports!!!
>>
>> The files are very likely broken.  I can take a look.  The change was
>> probably because of an "upgrade" to junrar.  Should I revert to the
>> version we used in 1.19.1?
>> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <lfcnassif@gmail.com> wrote:
>> >
>> > Hi Tim,
>> >
>> > Reading your great reports, I also saw some new exceptions with RAR files
>> > in likely broken folder, but seems tika was able to extract some text from
>> > them before. Do you know if those files are really broken and why tika
>> > extracted text from them before?
>> >
>> > Thank you,
>> > Luis
>> >
>> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison <tallison@apache.org>
>> > escreveu:
>> >
>> > > Reports are here:
>> > >
>> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>> > >
>> > > I'm going to revert the mp4 parser, and commit the few dependency
>> > > upgrades I ran.
>> > >
>> > > The _major_ difference in content for ppt is explained by the
>> > > duplication of header/footer info.  To confirm this, note that the
>> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
>> > > identical for nearly all ppt->ppt, but there are far more tokens in
>> > > "num_tokens_a" vs "num_tokens_b".
>> > >
>> > > I also see that we're losing content in x-java and x-groovy, etc., but
>> > > that's because we're now suppressing the style markup that our parser
>> > > was (incorrectly, IMHO, inserting) -- check the values in
>> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
>> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
>> > > weight: 3 | family: 2
>> > >
>> > > In short, I think we're good to go.  Will roll rc1 later today or
>> > > (more likely) tomorrow unless there are objections.
>> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <tallison@apache.org>
wrote:
>> > > >
>> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
>> > > shortly.
>> > > > On Fri, Nov 30, 2018 at 7:39 PM <loompa@gmail.com> wrote:
>> > > > >
>> > > > > Hi,
>> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <tallison@apache.org>
wrote:
>> > > > >
>> > > > > > Dave,
>> > > > > >   Should I try to get the Docker plugin working again?
>> > > > > >
>> > > > >
>> > > > > That would be great. I think I may have went down the wrong path
>> > > building
>> > > > > an image at package time, as there doesn't seem to be an easy
way to
>> > > > > publish it as an Apache labelled org on Dockerhub unless it builds
from
>> > > > > source.
>> > > > >
>> > > > > I have some time over the weekend, so could update to where I
got to
>> > > and
>> > > > > see what you think.
>> > > > >
>> > > > > Cheers,
>> > > > > Dave
>> > >

Mime
View raw message