tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Integrating Tika with Apache Beam
Date Thu, 25 May 2017 18:30:12 GMT
Awesome!

Any tips on building Beam?  Should it work on (dare I say) Windows?

Intellij is complaining that it can't find jdk.tools:jdk.tools:1.6 as a dependency under much
of the Hadoop modules.

mvn clean install is failing at Beam::SDKS::Java::Core


[ERROR]   AvroIOTest.testWriteDisplayData:561
Expected: display data with item: (with key is "filePrefix" and with type is <STRING>
and with value is "/foo")
     but: found 6 non-matching item(s):
<[]org.apache.beam.sdk.io.AvroIO$Write:codec=snappy
[]org.apache.beam.sdk.io.AvroIO$Write:schema=org.apache.beam.sdk.io.AvroIOTest$GenericClass
[]org.apache.beam.sdk.io.AvroIO$Write:fileSuffix=bar
[]org.apache.beam.sdk.io.AvroIO$Write:numShards=100
[]org.apache.beam.sdk.io.AvroIO$Write:shardNameTemplate=-SS-of-NN-
[]org.apache.beam.sdk.io.AvroIO$Write:filePrefix=C:\foo>
[ERROR]   FileBasedSinkTest.testRemoveWithTempFilename:148->testRemoveTemporaryFiles:261
temp file C:\Users\tallison\AppData\Local\Temp\junit5212433513605155196\temp\file0 exists
Expected: is <false>
     but: was <true>
[ERROR]   FileBasedSourceTest.testSplittingFailsOnEmptyFileExpansion
Expected: (an instance of java.io.FileNotFoundException and exception with message a string
containing "No files found for spec: C:\Users\tallison\AppData\Local\Temp\junit1719865221821921346\junit7087025770573441186/missing.txt")
     but: an instance of java.io.FileNotFoundException <java.lang.IllegalStateException:
Unable to find registrar for c> is a java.lang.IllegalStateException
Stacktrace was: java.lang.IllegalStateException: Unable to find registrar for c
        at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:447)
        at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:111)


among many other errors...
-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com] 
Sent: Thursday, May 25, 2017 12:47 PM
To: Allison, Timothy B. <tallison@mitre.org>; dev@tika.apache.org
Subject: Re: Integrating Tika with Apache Beam

Hi Guys

The link to the initial code is available in JIRA, at this stage the focus is on preparing
a solid initial PR, and then we can all improve Tika related code :-)

Cheers, Sergey
On 24/05/17 11:41, Sergey Beryozkin wrote:
> Hi Tim, All,
> 
> I thought I'd start a dedicated thread.
> 
> I added some initial comments to [1], I'm quite close now to creating 
> the initial PR.
> 
> Thanks, Sergey
> 
> [1] https://issues.apache.org/jira/browse/BEAM-2328
> On 23/05/17 17:42, Allison, Timothy B. wrote:
>> Another idea...if you have any interest, it would be great to get 
>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for 
>> our regression tests?
>>
>> -----Original Message-----
>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>> Sent: Friday, May 19, 2017 4:21 PM
>> To: user@tika.apache.org
>> Subject: Re: Extracting Text from embedded images in PDF docs
>>
>> Hi Tim
>>
>> Sure, once I get an initial PR ready I'll send an update and I'll 
>> explain what I did for a start and we will discuss it further
>>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/
Mime
View raw message