tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Re: [DISCUSS] Integrate Apache Any23 into Apache Tika
Date Fri, 18 Oct 2013 15:25:56 GMT

I had a look at Any23 some time ago and found that it overlapped with quite
a few other projects indeed but could (should?) have either relied on those
projects (e.g. parsing and mimetype stuff to Tika) or delegated the
functionality altogether (e.g. crawling to Nutch) instead of reinventing
the wheel and spread itself thin.

I am not familiar with the history of the project, where the code comes
from and who was behind it but I am a bit surprised that the project was
allowed to graduate from incubation without these points being addressed.

Migrating the code to Tika as a whole would not be a good idea I think.
However from a Tika point of view, it could be interesting to have the meta
parsers to convert the semantic information into a neutral representation
as a ContentHandler as in TIKA-980. Most people would probably be
interested in that more than the generation side of Any23 (what is referred
to as output format) which I think is not so relevant for Tika. From an
Any23 perspective, the project could then focus on the generation side and
just rely on Tika for pretty much everything else.

I haven't looked into Any23 in great detail and there could be other
interesting things to take from it.


On 18 October 2013 15:46, Ken Krugler <kkrugler_lists@transpac.com> wrote:

> Hi Lewis,
> I haven't have much time to look into Any23, which includes reviewing
> Markus's patch for integrating some portions of that into Tika (see
> https://issues.apache.org/jira/browse/TIKA-980)
> The main challenge I see is that Tika seems to do best as a wrapper for
> other parsers, versus outright ownership of parsers.
> Which isn't to say that rolling Any23 into Tika wouldn't work, but without
> at least one active developer it would seem likely that it would languish,
> without active development.
> But maybe that's OK…
> -- Ken
> On Oct 18, 2013, at 7:30am, Lewis John Mcgibbney wrote:
> > Hi Tika Dev's/PMC,
> >
> > This thread is aimed at recognizing common ground shared by Any23 and
> Tika
> > in an attempt to possibly integrate Any23 into Tika.
> > First however it will serve a purpose for me to put this into context and
> > also provide some rationale behind this initiative.
> >
> > It is my understanding that the Tika PMC sponsored Any23 through the
> Apache
> > Incubator until we (the Any23 PMC) were ready to graduate having made an
> > incubating release and having grown the community somewhat. Post
> > graduation, we made a 0.8.0 release in July 2013.
> >
> > It is also my understanding that the logical justification for the Tika
> > sponsoring us, was that it was envisaged (by numerous dev's) that there
> was
> > already some common ground between the aim and objectives of both
> projects
> > e.g. mime type detection, parsing, extraction of metadata, serialization,
> > etc. therefore with a little positive thinking and understanding of both
> > projects, one can clearly see the shared interests.
> >
> > I am speaking on behalf of the Any23 community here when I say that we
> have
> > however come to a realization that the community is not as vibrant as we
> > would like. This is combined with the fact that initial/original project
> > dev's are not around right now to keep the project moving in a forward
> > direction.
> >
> > It is therefore of interest to us, to approach the Tika community with
> the
> > intention of discussing a proposal to integrate Any23 code into Apache
> Tika.
> >
> > For those interested, the Any23 project URL is http://any23.apache.org,
> we
> > also have a live service which you can use to get a feel for what Any23
> > actually does. It can be found at http://any23.org.
> >
> > Any feedback from this community would be really appreciated, as it looks
> > like the alternative would be for us to take the code into the Apache
> > Attic... which is always a last resort.
> >
> > Thanks in advance.
> >
> > Lewis
> >
> > --
> > *Lewis*
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr

*Open Source Solutions for Text Engineering


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message