spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <ever...@nuna.com.INVALID>
Subject Re: Best way to go from RDD<String> to DataFrame of StringType columns
Date Fri, 17 Jun 2016 20:02:15 GMT
On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> Are these mainly in csv format?
>

Alas, no -- lots of different formats. Many are fixed width files, where I
have outside information to know which byte ranges correspond to which
columns. Some have odd null representations or non-comma delimiters (though
many of those cases might fit within the configurability of the spark-csv
package).





>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 June 2016 at 20:38, Everett Anderson <everett@nuna.com.invalid>
> wrote:
>
>> Hi,
>>
>> I have a system with files in a variety of non-standard input formats,
>> though they're generally flat text files. I'd like to dynamically create
>> DataFrames of string columns.
>>
>> What's the best way to go from a RDD<String> to a DataFrame of StringType
>> columns?
>>
>> My current plan is
>>
>>    - Call map() on the RDD<String> with a function to split the String
>>    into columns and call RowFactory.create() with the resulting array,
>>    creating a RDD<Row>
>>    - Construct a StructType schema using column names and StringType
>>    - Call SQLContext.createDataFrame(RDD, schema) to create the result
>>
>> Does that make sense?
>>
>> I looked through the spark-csv package a little and noticed that it's
>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
>> restricted developer API. Anyone know if it's recommended for use?
>>
>> Thanks!
>>
>> - Everett
>>
>>
>

Mime
View raw message