spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillermo Ortiz Fernández <>
Subject Parse RDD[Seq[String]] to DataFrame with types.
Date Mon, 15 Jul 2019 22:52:36 GMT
I'm trying to parse a RDD[Seq[String]] to Dataframe.
ALthough it's a Seq of Strings they could have a more specific type as Int,
Boolean, Double, String an so on.
For example, a line could be:
"hello", "1", "bye", "1.1"
"hello1", "11", "bye1", "2.1"

First column is going to be always a String, second an int and so on and
it's going to be always on this way. On the other hand, one execution could
have  seq of five elements and others the sequences could have 2000, so it
depends of the execution but in each execution I know the types of each
"column" or "elem" of the sequence.

To do it, I could have something like this:
//I could have a parameter to generate the StructType dinamically.
def getSchema(): StructType = {
  var schemaArray = scala.collection.mutable.ArrayBuffer[StructField]()
  schemaArray += StructField("col1" , IntegerType, true)
  schemaArray += StructField("col2" , StringType, true)
  schemaArray += StructField("col2" , DoubleType, true)

//Array of Any?? it doesn't seem the best option!!
val l1: Seq[Any] = Seq(1,"2", 1.1 )
val rdd1 = sc.parallelize(Lz).map(Row.fromSeq(_))

val schema = getSchema()
val df = sqlContext.createDataFrame(rdd1, schema)

I don't like at all to have a Seq of Any, but it's really what I have.
Another chance??

On the other hand I was thinking that I have something similar to a CSV, I
could create one. With spark there is a library to read an CSV and return a
dataframe where types are infered. Is it possible to call it if I have
already an RDD[String]?

View raw message