spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: distribute work (files)
Date Wed, 07 Sep 2016 04:20:59 GMT
To access local file, try with file:// URI.

On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi <pete.figliozzi@gmail.com>
wrote:

> This is a great question.  Basically you don't have to worry about the
> details-- just give a wildcard in your call to textFile.  See the Programming
> Guide <http://spark.apache.org/docs/latest/programming-guide.html> section
> entitled "External Datasets".  The Spark framework will distribute your
> data across the workers.  Note that:
>
> *If using a path on the local filesystem, the file must also be accessible
>> at the same path on worker nodes. Either copy the file to all workers or
>> use a network-mounted shared file system.*
>
>
> In your case this would mean the directory of files.
>
> Curiously, I cannot get this to work when I mount a directory with sshfs
> on all of my worker nodes.  It says "file not found" even though the file
> clearly exists in the specified path on all workers.   Anyone care to try
> and comment on this?
>
> Thanks,
>
> Pete
>
> On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler <icklerly@googlemail.com>
> wrote:
>
>> Hi,
>>
>> maybe this is a stupid question:
>>
>> I have a list of files. Each file I want to take as an input for a
>> ML-algorithm. All files are independent from another.
>> My question now is how do I distribute the work so that each worker takes
>> a block of files and just runs the algorithm on them one by one.
>> I hope somebody can point me in the right direction! :)
>>
>> Best regards,
>> Lydia
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>


-- 
Best Regards,
Ayan Guha

Mime
View raw message