tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Tika 0.2 Release
Date Sat, 29 Nov 2008 00:43:58 GMT

On Sat, Nov 29, 2008 at 12:02 AM, Chris Hostetter
<hossman_lucene@fucit.org> wrote:
> My comments on RC1 are below.  i don't feel comfortable voting for it in
> it's current state...

Thanks for the review, much appreciated!

I think it's fair to say that with the 0.2 release we're still pretty
much in the transition for the Incubator to Lucene (and from a
developer-only product to a general end user product). The main drive
(at least from my side) for the 0.2 release was just to get whatever
we had at the moment released as soon as possible for interested users
(release early, release often), and then focus in 0.3 to get all the
extra stuff like documentation and extra build artifacts in place.

I should also note that Chris Mattman did call (see
http://markmail.org/message/ux3uc72zlwarow5i) for the release to be
made clearly either as an Incubator release or as a Lucene release
once all the project migration is done. I guess I was the main
proponent in pushing for the 0.2 release already while the Lucene
migration was still incomplete.

> 1) release naming: should probably be apache-tika-0.2-src.jar  i seem to
> recall someone somewhere saying that was important for apache releases
> (and it's more consistent with the the 0.1 release)

Good point, we probably should do that. Dave, can you take care of this?

> 2) release file format: the 0.1 release seems to have been a tar.gz ...
> was a concious choice made by the community to switch to distributing as a
> src jar? otherwise you may want to publish both, or stick with tar.gz for
> consistency (the docs on the website refer to the tarball when giving
> examples of downloading and verifying)

At least I was pretty vocal about switching to the jar format for our
source releases, see most notably
http://markmail.org/message/mwi4w2odztsxlcgi and
http://markmail.org/message/jnthn2q4pghqxjlc. Unless the PMC prefers a
tarball, at least I would rather fix the documentation than change the
packaging format.

> 3) incubator refs: as mentioned before, there are a lot of refrences to
> the incubator that should be switched to point to lucene...
> hossman@coaster:~/tmp/tika-release/rc1/tika-0.2$ grep -lir incubator .
> ./pom.xml
> ./src/site/apt/download.apt
> ./src/site/apt/index.apt
> ./README.txt

Fair point, and it goes with my statement above about getting the
release out as soon as possible after graduation. In Tika trunk we've
now updated all Incubator references, so any new release will have
this issue fixed. Given the PMC pushback; perhaps we should just scrap
the 0.2 release and go directly to 0.3 based on the current trunk?

> 4) user docs: (I think grant may have already mentioned this) The
> README.txt file talks about building Tika, but there doesn't seem to be
> anything in the release that describes how to use Tika ... has any thought
> been given to including more docs in the release it self? --
> gettingstarted.html perhaps? ... at the very least a paragraph should be
> added to the README refering to the gettingstarted.html page.
> Personally, i think including documentation.html and formats.html in the
> release are also important -- they're going to change between releases,
> probably more then the "getting started" type info, and should be
> "versioned" so moving forward people with older versions won't get
> misslead by the docs on the site.

The available documentation is already included in the source release
in src/site and can be generated with "mvn site". The fact that the
documentation isn't complete (e.g. the Getting Started guide didn't
yet exist in 0.2 release candidate) shouldn't IMHO be a blocker for a
release (especially for a 0.x one). In any case it's an area where we
are clearly getting better during the 0.x release cycle.

The README could mention "mvn site" as the command to generate the
official documentation for that release and we could include a static
snapshot of that in http://lucene.apache.org/tika/ for reference. This
is something we should look at.

> 5) artifacts missing: i tried following along with the gettingstarted.html
> (my first time using maven BTW so i may have messed something up) and ran
> into a snag... "mvn install" download a bunch of dependencies (i think
> they were maven's own dependencies since i'd never used it before), ran
> some test (these definitely had tika in the name) then downloaded some
> more things, then told me it was installing tika-0.2.jar in my ~/.m2
> directory.  When i looked at the next section "Build artifacts" it refered
> to 3 jars in my target directory -- but i only have one...
> hossman@coaster:~/tmp/tika-release/rc1/tika-0.2$ find target -name \*jar
> target/tika-0.2.jar
> ...is the gettingstarted.html wrong, or did the build not run correctly?

The Getting Started guide is wrong in claiming that the standalone jar
should be available in a 0.2 build. I've fixed this in revision
721589. Only the tika-0.2.jar is produced by the 0.2 build.

Currently the guide contains some forward-looking statements about the
potentially upcoming 0.3 release; mostly that the "standalone" and
"jdk14" artifacts are included in 0.3 (they are available in current
trunk and the related Jira issues are targeted for release in 0.3). In
general I think it's not a good idea to publish documents with such
forward-looking statements, but in this case I think there is a pretty
good consensus about the contents of Tika 0.3 and when writing the
documentation I rather opted to publishing forward-looking information
than keeping it back and having to revise the document later on.

> 6) RAT: Apache RAT noticed the following files missing license info...
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/resources/tika.svg
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/resources/tikaNoText.svg
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testHTML.html
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testHTML_utf8.html
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testRTF.rtf
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testTXT.txt
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testXHTML.html
>  !????? /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testXML.xml
> ...I don't know if i've ever heard an opinion on needing to include the
> ASL header in *.svg files (they are xml, but they are also clearly
> generated by inkscape), but I do remember someone pointing out that test
> data files in formats that are capable of containing comments in them (ie:
> xml, html, etc...) should include the ASL header, such as...
> http://svn.apache.org/repos/asf/lucene/solr/trunk/example/exampledocs/hd.xml

I think that having the license header in such test files disrupts the
main purpose of the test cases (i.e. you want to check whether the
extracted text contains some specific test phrase, not necessarily the
Apache license header), so at least I prefer to not include the
license header in those test files. See also
http://markmail.org/message/m7jmgl3qncsffygb for related discussion on

However, if the PMC so wishes, I don't see any big problem in us
adding the license headers in these test files. Note that in some
future test files this might be troublesome, but for existing tests I
don't see problems with this.

> 7) javadocs: maybe this is something that is obvious to maven users, and
> as a non-maven user i just don't know the magic incantation, but i
> couldn't find any generated javadocs in the release (or in the "target"
> directory after running "mv install") ... since Tika is primarily a
> library people will use in java apps, this seems kind of important.  If
> there is a magic maven incantation to build these, let's included the
> instructions somewhere (since the gettingstarted guide suggests that maven
> is neccessary to build tika, but not to use it (per the Artifacts and Ant
> sections)

Good point. The README could point out "mvn site" as the way to
produce a browseable version of all documentation associated with the
release, and as an added service we could (should?) publish specific
per-version documentation also on the Tika web site.

On the other hand, I don't see documentation as being a valid blocker
for any 0.x release.


Jukka Zitting

View raw message