tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: TikaIO concerns
Date Fri, 22 Sep 2017 17:33:09 GMT
Nice!  Thank you!

-----Original Message-----
From: Ben Chambers [mailto:bchambers@apache.org] 
Sent: Friday, September 22, 2017 1:24 PM
To: dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

BigQueryIO allows a side-output for elements that failed to be inserted when using the Streaming
BigQuery sink:

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92

This follows the pattern of a DoFn with multiple outputs, as described here https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow

So, the DoFn that runs the Tika code could be configured in terms of how different failures
should be handled, with the option of just outputting them to a different PCollection that
is then processed in some other way.

On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <tallison@mitre.org>
wrote:

> Do tell...
>
> Interesting.  Any pointers?
>
> -----Original Message-----
> From: Ben Chambers [mailto:bchambers@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:50 PM
> To: dev@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
>
> Regarding specifically elements that are failing -- I believe some 
> other IO has used the concept of a "Dead Letter" side-output,, where 
> documents that failed to process are side-output so the user can 
> handle them appropriately.
>
> On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov 
> <kirpichov@google.com.invalid> wrote:
>
> > Hi Tim,
> > From what you're saying it sounds like the Tika library has a big 
> > problem with crashes and freezes, and when applying it at scale (eg.
> > in the context of Beam) requires explicitly addressing this problem, 
> > eg. accepting the fact that in many realistic applications some 
> > documents will just need to be skipped because they are unprocessable?
> > This would be first example of a Beam IO that has this concern, so 
> > I'd like to confirm that my understanding is correct.
> >
> > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > <tallison@mitre.org>
> > wrote:
> >
> > > Reuven,
> > >
> > > Thank you!  This suggests to me that it is a good idea to 
> > > integrate Tika with Beam so that people don't have to 1) 
> > > (re)discover the need to make their wrappers robust and then 2) 
> > > have to reinvent these wheels for robustness.
> > >
> > > For kicks, see William Palmer's post on his toe-stubbing efforts 
> > > with Hadoop [1].  He and other Tika users independently have wound 
> > > up carrying out exactly your recommendation for 1) below.
> > >
> > > We have a MockParser that you can get to simulate regular 
> > > exceptions,
> > OOMs
> > > and permanent hangs by asking Tika to parse a <mock> xml [2].
> > >
> > > > However if processing the document causes the process to crash, 
> > > > then it
> > > will be retried.
> > > Any ideas on how to get around this?
> > >
> > > Thank you again.
> > >
> > > Cheers,
> > >
> > >            Tim
> > >
> > > [1]
> > >
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising
> > -w
> > eb-content-nanite/
> > > [2]
> > >
> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/res
> > ou rces/test-documents/mock/example.xml
> > >
> >
>
Mime
View raw message