spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Handling windows characters with Spark CSV on Linux
Date Thu, 17 Nov 2016 13:20:31 GMT
You can do the conversion of character set (is this the issue?) as part of your loading process
in Spark.
As far as i know the spark CSV package is based on Hadoop TextFileInputformat. This format
to my best of knowledge supports only utf-8. So you have to do a conversion from windows to
utf-8. If you refer to language specific settings (numbers, dates etc) - this is also not

I started to work on the hadoopoffice library (which you can use with Spark) where you can
read Excel files directly (, there is no
official release - yet. There you can specify also the language in which you want to represent
data values, numbers etc. when reading the file.

> On 17 Nov 2016, at 14:11, Mich Talebzadeh <> wrote:
> Hi,
> In the past with Databricks package for csv files on occasions I had to do some cleaning
at Linux directory level before ingesting CSV file into HDFS staging directory for Spark to
read it.
> I have a more generic issue that may have to be ready.
> Assume that a provides using FTP to push CSV files into Windows directories. The whole
solution is built around windows and .NET.
> Now you want to ingest those files into HDFS and process them with Spark CSV.
> One can create NFS directories visible to Windows server and HDFS as well. However, there
may be issues with character sets etc. What are the best ways of handling this? One way would
be to use some scripts to make these spreadsheet time files compatible with Linux and then
load them into HDFS. For example I know that if I saved a Excel spresheet file with DOS FORMAT,
that file will work OK with Spark CSV.  Are there tools to do this as well?
> Thanks
> Dr Mich Talebzadeh
> LinkedIn
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.

View raw message