spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Databricks fails to read the csv file with blank line at the file header
Date Sun, 27 Mar 2016 13:53:12 GMT
Pretty simple as usual it is a combination of ETL and ELT.

Basically csv files are loaded into staging directory on host, compressed
before pushing into hdfs


   1. ETL --> Get rid of the header blank line on the csv files
   2. ETL --> Compress the csv files
   3. ETL --> Put the compressed CVF files  into hdfs staging directory
   4. ELT --> Use databricks to load the csv files
   5. ELT --> Spark FP to prcess the csv data
   6. ELT --> register it as a temporary table
   7. ELT --> Create an ORC table in a named database in compressed zlib2
   format in Hive database
   8. ELT --> Insert/select from temporary table to Hive table


So the data is stored in an ORC table and one can do whatever analysis
using Spark, Hive etc



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 27 March 2016 at 03:05, Koert Kuipers <koert@tresata.com> wrote:

> To me this is expected behavior that I would not want fixed, but if you
> look at the recent commits for spark-csv it has one that deals this...
> On Mar 26, 2016 21:25, "Mich Talebzadeh" <mich.talebzadeh@gmail.com>
> wrote:
>
>>
>> Hi,
>>
>> I have a standard csv file (saved as csv in HDFS) that has first line of
>> blank at the header
>> as follows
>>
>> [blank line]
>> Date, Type, Description, Value, Balance, Account Name, Account Number
>> [blank line]
>> 22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN
>> AE","'638585-60125663",
>>
>> When I read this file using the following standard
>>
>> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header",
>> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
>>
>> it crashes.
>>
>> java.util.NoSuchElementException
>>         at java.util.ArrayList$Itr.next(ArrayList.java:794)
>>
>>  If I go and manually delete the first blank line it works OK
>>
>> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header",
>> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
>>
>> df: org.apache.spark.sql.DataFrame = [Date: string,  Type: string,
>> Description: string,  Value: double,  Balance: double,  Account Name:
>> string,  Account Number: string]
>>
>> I can easily write a shell script to get rid of blank line. I was
>> wondering if databricks does have a flag to get rid of the first blank line
>> in csv file format?
>>
>> P.S. If the file is stored as DOS text file, this problem goes away.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>

Mime
View raw message