tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Integrating Tika with Apache Beam
Date Thu, 25 May 2017 18:40:10 GMT
Ha....Beam doesn't work on Windows currently...

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Thursday, May 25, 2017 2:30 PM
To: Sergey Beryozkin <sberyozkin@gmail.com>; dev@tika.apache.org
Subject: RE: Integrating Tika with Apache Beam


Any tips on building Beam?  Should it work on (dare I say) Windows?

Intellij is complaining that it can't find jdk.tools:jdk.tools:1.6 as a dependency under many
of the Hadoop modules.

mvn clean install is failing at Beam::SDKS::Java::Core

[ERROR]   AvroIOTest.testWriteDisplayData:561
Expected: display data with item: (with key is "filePrefix" and with type is <STRING>
and with value is "/foo")
     but: found 6 non-matching item(s):
[ERROR]   FileBasedSinkTest.testRemoveWithTempFilename:148->testRemoveTemporaryFiles:261
temp file C:\Users\tallison\AppData\Local\Temp\junit5212433513605155196\temp\file0 exists
Expected: is <false>
     but: was <true>
[ERROR]   FileBasedSourceTest.testSplittingFailsOnEmptyFileExpansion
Expected: (an instance of java.io.FileNotFoundException and exception with message a string
containing "No files found for spec: C:\Users\tallison\AppData\Local\Temp\junit1719865221821921346\junit7087025770573441186/missing.txt")
     but: an instance of java.io.FileNotFoundException <java.lang.IllegalStateException:
Unable to find registrar for c> is a java.lang.IllegalStateException Stacktrace was: java.lang.IllegalStateException:
Unable to find registrar for c
        at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:447)
        at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:111)

among many other errors...
-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
Sent: Thursday, May 25, 2017 12:47 PM
To: Allison, Timothy B. <tallison@mitre.org>; dev@tika.apache.org
Subject: Re: Integrating Tika with Apache Beam

Hi Guys

The link to the initial code is available in JIRA, at this stage the focus is on preparing
a solid initial PR, and then we can all improve Tika related code :-)

Cheers, Sergey
On 24/05/17 11:41, Sergey Beryozkin wrote:
> Hi Tim, All,
> I thought I'd start a dedicated thread.
> I added some initial comments to [1], I'm quite close now to creating 
> the initial PR.
> Thanks, Sergey
> [1] https://issues.apache.org/jira/browse/BEAM-2328
> On 23/05/17 17:42, Allison, Timothy B. wrote:
>> Another idea...if you have any interest, it would be great to get 
>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for 
>> our regression tests?
>> -----Original Message-----
>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>> Sent: Friday, May 19, 2017 4:21 PM
>> To: user@tika.apache.org
>> Subject: Re: Extracting Text from embedded images in PDF docs
>> Hi Tim
>> Sure, once I get an initial PR ready I'll send an update and I'll 
>> explain what I did for a start and we will discuss it further

Sergey Beryozkin

Talend Community Coders
View raw message