spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: Databricks fails to read the csv file with blank line at the file header
Date Mon, 28 Mar 2016 22:19:24 GMT
Could I ask which version are you using?

It looks the cause is the empty line right after header (because that case
is not being checked in tests).

However, for empty lines before the header or inside date, they are being
tested.

https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/ages.csv

https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/cars.csv

So I think it might have to be able to read that case as well and this
might be an issue.

It can be simply done by ETL but I think the behaviour might have to be
consistent.

Maybe would this be better if this issue is open and discussed?
On 29 Mar 2016 6:54 a.m., "Ashok Kumar" <ashok34668@yahoo.com.invalid>
wrote:

> Thanks a ton sir. Very helpful
>
>
> On Monday, 28 March 2016, 22:36, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>
> Pretty straight forward
>
> #!/bin/ksh
> DIR="hdfs://<hostname>:9000/data/stg/accounts/nw/xxxxxxxxx"
> #
> ## Remove the blank header line from the spreadsheets and compress them
> #
> echo `date` " ""=======  Started Removing blank header line and
> Compressing all csv FILEs"
> for FILE in `ls *.csv`
> do
>   sed '1d' ${FILE} > ${FILE}.tmp
>   mv -f ${FILE}.tmp ${FILE}
>   /usr/bin/bzip2 ${FILE}
> done
> #
> ## Clear out hdfs staging directory
> #
> echo `date` " ""=======  Started deleting old files from hdfs staging
> directory ${DIR}"
> hdfs dfs -rm -r ${DIR}/*.bz2
> echo `date` " ""=======  Started Putting bz2 fileS to hdfs staging
> directory ${DIR}"
> for FILE in `ls *.bz2`
> do
>   hdfs dfs -copyFromLocal ${FILE} ${DIR}
> done
> echo `date` " ""=======  Checking that all files are moved to hdfs staging
> directory"
> hdfs dfs -ls ${DIR}
> exit 0
> HTH
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
> On 28 March 2016 at 22:24, Ashok Kumar <ashok34668@yahoo.com> wrote:
>
> Hello Mich
>
> If you accommodate can you please share your approach to steps 1-3 above.
>
> Best regards
>
>
> On Sunday, 27 March 2016, 14:53, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>
> Pretty simple as usual it is a combination of ETL and ELT.
>
> Basically csv files are loaded into staging directory on host, compressed
> before pushing into hdfs
>
>
>    1. ETL --> Get rid of the header blank line on the csv files
>    2. ETL --> Compress the csv files
>    3. ETL --> Put the compressed CVF files  into hdfs staging directory
>    4. ELT --> Use databricks to load the csv files
>    5. ELT --> Spark FP to prcess the csv data
>    6. ELT --> register it as a temporary table
>    7. ELT --> Create an ORC table in a named database in compressed zlib2
>    format in Hive database
>    8. ELT --> Insert/select from temporary table to Hive table
>
>
> So the data is stored in an ORC table and one can do whatever analysis
> using Spark, Hive etc
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
> On 27 March 2016 at 03:05, Koert Kuipers <koert@tresata.com> wrote:
>
> To me this is expected behavior that I would not want fixed, but if you
> look at the recent commits for spark-csv it has one that deals this...
> On Mar 26, 2016 21:25, "Mich Talebzadeh" <mich.talebzadeh@gmail.com>
> wrote:
>
>
> Hi,
>
> I have a standard csv file (saved as csv in HDFS) that has first line of
> blank at the header
> as follows
>
> [blank line]
> Date, Type, Description, Value, Balance, Account Name, Account Number
> [blank line]
> 22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN
> AE","'638585-60125663",
>
> When I read this file using the following standard
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
>
> it crashes.
>
> java.util.NoSuchElementException
>         at java.util.ArrayList$Itr.next(ArrayList.java:794)
>
>  If I go and manually delete the first blank line it works OK
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
>
> df: org.apache.spark.sql.DataFrame = [Date: string,  Type: string,
> Description: string,  Value: double,  Balance: double,  Account Name:
> string,  Account Number: string]
>
> I can easily write a shell script to get rid of blank line. I was
> wondering if databricks does have a flag to get rid of the first blank line
> in csv file format?
>
> P.S. If the file is stored as DOS text file, this problem goes away.
>
> Thanks
>
> Dr Mich Talebzadeh
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
>
>
>
>

Mime
View raw message