tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: Resource Sharing Tika Corpus with Any23
Date Tue, 11 Dec 2018 01:22:13 GMT
Sorry for my delay, send me the usernames and email addresses
privately and I'll grant access.  We're coming up on a release cycle.
On Fri, Nov 30, 2018 at 8:14 PM Lewis John McGibbney <lewismc@apache.org> wrote:
>
> Hi Tim,
> Thanks for the reply... answer inline
>
> On 2018/11/30 19:22:23, Tim Allison <tallison@apache.org> wrote:
> > I think that'd be great.  Some questions:
> >
> > 1) Would you use the same input docs that we're using or would you
> > need/want a new TB drive for your input/output?
>
> The same docs I suspect. We *could* contribute the documents we use in our test suite
as well
> https://github.com/apache/any23/tree/master/test-resources/src/test/resources
> however this is not really necessary for us to run Any23. Any23 will only attempt extractions
on a small subset of the documents in the corpus.
>
> > How much space will
> > you need for your eval framework including outputs?
>
> I wouldn't imagine any more than maybe 5GB disk space in all. Any23 has the ability to
run Open Information Extraction (smart relationship extraction from text) and this tends to
generate more triples. If we decided to turn this on, then it would probably get towards the
5GB mark. I wouldnt imagine any more than that thought Tim.
>
> > 2) Would you be willing to coordinate with us and PDFBox and POI
> > around release times?
>
> I think so yes. If anything this would be an excellent thing for Any23. I think improved
coordination and communication between the communities would be a very positive step.
>
> > 3) Would you be running your processing every so often (around your
> > releases) or would it be constant aside from our releases?
>
> Most likely the former. I am aware that the service is billed to someones (your) card.
So we would be looking to do only what is polite and acceptable. Prior to releases e.g. during
review of a release candidate would be really cool.
>
> >  I ask
> > because I'd like @Tobias Ospelt to have cycles for his fuzzing work
> > when we're not getting ready for a release.
> >
>
> That sounds fine to me.
> Thank you for the response.

Mime
View raw message