spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: NEW to spark and sparksql
Date Thu, 20 Nov 2014 18:43:21 GMT
I believe functions like sc.textFile will also accept paths with globs for
example "/data/*/" which would read all the directories into a single RDD.
Under the covers I think it is just using Hadoop's FileInputFormat, in case
you want to google for the full list of supported syntax.

On Thu, Nov 20, 2014 at 7:27 AM, Sam Flint <sam.flint@magnetic.com> wrote:

> So you are saying to query an entire day of data I would need to create
> one RDD for every hour and then union them into one RDD.  After I have the
> one RDD I would be able to query for a=2 throughout the entire day.
> Please correct me if I am wrong.
>
> Thanks
>
> On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust <michael@databricks.com>
> wrote:
>
>> I would use just textFile unless you actually need a guarantee that you
>> will be seeing a whole file at time (textFile splits on new lines).
>>
>> RDDs are immutable, so you cannot add data to them.  You can however
>> union two RDDs, returning a new RDD that contains all the data.
>>
>> On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint <sam.flint@magnetic.com>
>> wrote:
>>
>>> Michael,
>>>     Thanks for your help.   I found a wholeTextFiles() that I can use to
>>> import all files in a directory.  I believe this would be the case if all
>>> the files existed in the same directory.  Currently the files come in by
>>> the hour and are in a format somewhat like this ../2014/10/01/00/filename
>>> and there is one file per hour.
>>>
>>> Do I create an RDD and add to it? Is that possible?  My example query
>>> would be select count(*) from (entire day RDD) where a=2.  "a" would exist
>>> in all files multiple times with multiple values.
>>>
>>> I don't see in any documentation how to import a file create an RDD then
>>> import another file into that RDD.   kinda like in mysql when you create a
>>> table import data then import more data.  This may be my ignorance because
>>> I am not that familiar with spark, but I would expect to import data into a
>>> single RDD before performing analytics on it.
>>>
>>> Thank you for your time and help on this.
>>>
>>>
>>> P.S. I am using python if that makes a difference.
>>>
>>> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> In general you should be able to read full directories of files as a
>>>> single RDD/SchemaRDD.  For documentation I'd suggest the programming
>>>> guides:
>>>>
>>>> http://spark.apache.org/docs/latest/quick-start.html
>>>> http://spark.apache.org/docs/latest/sql-programming-guide.html
>>>>
>>>> For Avro in particular, I have been working on a library for Spark
>>>> SQL.  Its very early code, but you can find it here:
>>>> https://github.com/databricks/spark-avro
>>>>
>>>> Bug reports welcome!
>>>>
>>>> Michael
>>>>
>>>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint <sam.flint@magnetic.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>     I am new to spark.  I have began to read to understand sparks RDD
>>>>> files as well as SparkSQL.  My question is more on how to build out the
RDD
>>>>> files and best practices.   I have data that is broken down by hour into
>>>>> files on HDFS in avro format.   Do I need to create a separate RDD for
each
>>>>> file? or using SparkSQL a separate SchemaRDD?
>>>>>
>>>>> I want to be able to pull lets say an entire day of data into spark
>>>>> and run some analytics on it.  Then possibly a week, a month, etc.
>>>>>
>>>>>
>>>>> If there is documentation on this procedure or best practives for
>>>>> building RDD's please point me at them.
>>>>>
>>>>> Thanks for your time,
>>>>>    Sam
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *MAGNE**+**I**C*
>>>
>>> *Sam Flint* | *Lead Developer, Data Analytics*
>>>
>>>
>>>
>>
>
>
> --
>
> *MAGNE**+**I**C*
>
> *Sam Flint* | *Lead Developer, Data Analytics*
>
>
>

Mime
View raw message