spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Handling windows characters with Spark CSV on Linux
Date Thu, 17 Nov 2016 13:11:21 GMT
Hi,

In the past with Databricks package for csv files on occasions I had to do
some cleaning at Linux directory level before ingesting CSV file into HDFS
staging directory for Spark to read it.

I have a more generic issue that may have to be ready.

Assume that a provides using FTP to push CSV files into Windows
directories. The whole solution is built around windows and .NET.

Now you want to ingest those files into HDFS and process them with Spark
CSV.

One can create NFS directories visible to Windows server and HDFS as well.
However, there may be issues with character sets etc. What are the best
ways of handling this? One way would be to use some scripts to make these
spreadsheet time files compatible with Linux and then load them into HDFS.
For example I know that if I saved a Excel spresheet file with DOS FORMAT,
that file will work OK with Spark CSV.  Are there tools to do this as well?

Thanks


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Mime
View raw message