spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: File list read into single RDD
Date Sun, 18 May 2014 18:13:28 GMT
Spark's sc.textFile()<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456>
method
delegates to sc.hadoopFile(), which uses Hadoop's
FileInputFormat.setInputPaths()<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L546>call.
 There is no alternate storage system, Spark just delegates to Hadoop
for the .textFile() call.

Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so
you can use Spark on data in S3 using s3:/// just the same as you would
with HDFS.  See Apache's documentation on
S3<https://wiki.apache.org/hadoop/AmazonS3> for
more details.

As far as interacting with a FileSystem (HDFS or other) to list files,
delete files, navigate paths, etc. from your driver program, you should be
able to just instantiate a FileSystem object and use the normal Hadoop APIs
from there.  The Apache getting started docs on reading/writing from Hadoop
DFS <https://wiki.apache.org/hadoop/HadoopDfsReadWriteExample> should work
the same for non-HDFS examples too.

I do think we could use a little "recipe" in our documentation to make
interacting with HDFS a bit more straightforward.

Pat, if you get something that covers your case that you don't mind
sharing, we can format it for including in future Spark docs.

Cheers!
Andrew


On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:

> Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI.
> Since Spark supports several FS schemes I’m unclear about how much to
> assume about using the hadoop file systems APIs and conventions. Concretely
> if I pass a pattern in with a HTTPS file system, will the pattern work?
>
> How does Spark implement its storage system? This seems to be an
> abstraction level beyond what is available in HDFS. In order to preserve
> that flexibility what APIs should I be using? It would be easy to say, HDFS
> only and use HDFS APIs but that would seem to limit things. Especially
> where you would like to read from one cluster and write to another. This is
> not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge.
>
> If I can stick to passing URIs to sc.textFile() I’m ok but if I need to
> examine the structure of the file system, I’m unclear how I should do it
> without sacrificing Spark’s flexibility.
>
> On Apr 29, 2014, at 12:55 AM, Christophe Préaud <
> christophe.preaud@kelkoo.com> wrote:
>
>  Hi,
>
> You can also use any path pattern as defined here:
> http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
>
> e.g.:
>
> sc.textFile('{/path/to/file1,/path/to/file2}')
>
> Christophe.
>
> On 29/04/2014 05:07, Nicholas Chammas wrote:
>
> Not that I know of. We were discussing it on another thread and it came
> up.
>
>  I think if you look up the Hadoop FileInputFormat API (which Spark uses)
> you'll see it mentioned there in the docs.
>
>
> http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
>
>  But that's not obvious.
>
>  Nick
>
> 2014년 4월 28일 월요일, Pat Ferrel<pat.ferrel@gmail.com> 님이 작성한
메시지:
>
>> Perfect.
>>
>>  BTW just so I know where to look next time, was that in some docs?
>>
>>   On Apr 28, 2014, at 7:04 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>  Yep, as I just found out, you can also provide sc.textFile() with a
>> comma-delimited string of all the files you want to load.
>>
>> For example:
>>
>> sc.textFile('/path/to/file1,/path/to/file2')
>>
>> So once you have your list of files, concatenate their paths like that
>> and pass the single string to textFile().
>>
>> Nick
>>
>>
>> On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>
>>> sc.textFile(URI) supports reading multiple files in parallel but only
>>> with a wildcard. I need to walk a dir tree, match a regex to create a list
>>> of files, then I’d like to read them into a single RDD in parallel. I
>>> understand these could go into separate RDDs then a union RDD can be
>>> created. Is there a way to create a single RDD from a URI list?
>>
>>
>>
>>
>
> ------------------------------
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>
>

Mime
View raw message