tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Beryozkin <sberyoz...@gmail.com>
Subject Re: Integrating Tika with Apache Beam
Date Thu, 25 May 2017 20:30:16 GMT
Hi Tim

I just used 'mvn install -DskipTests=true' to quickly build it, and did 
'mvn clean install' inside the tika module.

I use Eclipse, Beam docs on how to set up it are good, except that it 
did not quite work for me yet for all of Beam, only managed to import 
the individual Tika module
Cheers, Sergey
On 25/05/17 19:30, Allison, Timothy B. wrote:
> Awesome!
> 
> Any tips on building Beam?  Should it work on (dare I say) Windows?
> 
> Intellij is complaining that it can't find jdk.tools:jdk.tools:1.6 as a dependency under
much of the Hadoop modules.
> 
> mvn clean install is failing at Beam::SDKS::Java::Core
> 
> 
> [ERROR]   AvroIOTest.testWriteDisplayData:561
> Expected: display data with item: (with key is "filePrefix" and with type is <STRING>
and with value is "/foo")
>       but: found 6 non-matching item(s):
> <[]org.apache.beam.sdk.io.AvroIO$Write:codec=snappy
> []org.apache.beam.sdk.io.AvroIO$Write:schema=org.apache.beam.sdk.io.AvroIOTest$GenericClass
> []org.apache.beam.sdk.io.AvroIO$Write:fileSuffix=bar
> []org.apache.beam.sdk.io.AvroIO$Write:numShards=100
> []org.apache.beam.sdk.io.AvroIO$Write:shardNameTemplate=-SS-of-NN-
> []org.apache.beam.sdk.io.AvroIO$Write:filePrefix=C:\foo>
> [ERROR]   FileBasedSinkTest.testRemoveWithTempFilename:148->testRemoveTemporaryFiles:261
temp file C:\Users\tallison\AppData\Local\Temp\junit5212433513605155196\temp\file0 exists
> Expected: is <false>
>       but: was <true>
> [ERROR]   FileBasedSourceTest.testSplittingFailsOnEmptyFileExpansion
> Expected: (an instance of java.io.FileNotFoundException and exception with message a
string containing "No files found for spec: C:\Users\tallison\AppData\Local\Temp\junit1719865221821921346\junit7087025770573441186/missing.txt")
>       but: an instance of java.io.FileNotFoundException <java.lang.IllegalStateException:
Unable to find registrar for c> is a java.lang.IllegalStateException
> Stacktrace was: java.lang.IllegalStateException: Unable to find registrar for c
>          at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:447)
>          at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:111)
> 
> 
> among many other errors...
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Thursday, May 25, 2017 12:47 PM
> To: Allison, Timothy B. <tallison@mitre.org>; dev@tika.apache.org
> Subject: Re: Integrating Tika with Apache Beam
> 
> Hi Guys
> 
> The link to the initial code is available in JIRA, at this stage the focus is on preparing
a solid initial PR, and then we can all improve Tika related code :-)
> 
> Cheers, Sergey
> On 24/05/17 11:41, Sergey Beryozkin wrote:
>> Hi Tim, All,
>>
>> I thought I'd start a dedicated thread.
>>
>> I added some initial comments to [1], I'm quite close now to creating
>> the initial PR.
>>
>> Thanks, Sergey
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>> On 23/05/17 17:42, Allison, Timothy B. wrote:
>>> Another idea...if you have any interest, it would be great to get
>>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for
>>> our regression tests?
>>>
>>> -----Original Message-----
>>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>> Sent: Friday, May 19, 2017 4:21 PM
>>> To: user@tika.apache.org
>>> Subject: Re: Extracting Text from embedded images in PDF docs
>>>
>>> Hi Tim
>>>
>>> Sure, once I get an initial PR ready I'll send an update and I'll
>>> explain what I did for a start and we will discuss it further
>>>
> 
> 
> --
> Sergey Beryozkin
> 
> Talend Community Coders
> http://coders.talend.com/
> 


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Mime
View raw message