spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaonary Rabarisoa <>
Subject Need some help to create user defined type for ML pipeline
Date Mon, 19 Jan 2015 08:59:55 GMT
Hi all,

I'm trying to implement a pipeline for computer vision based on the latest
ML package in spark. The first step of my pipeline is to decode image (jpeg
for instance) stored in a parquet file.
For this, I begin to create a UserDefinedType that represents a decoded
image stored in a array of byte. Here is my first attempt :

*@SQLUserDefinedType(udt = classOf[ByteImageUDT])class
ByteImage(channels: Int, width: Int, height: Int, data:
Array[Byte])private[spark] class ByteImageUDT extends
UserDefinedType[ByteImage] {  override def sqlType: StructType = {
// type: 0 = sparse, 1 = dense    // We only use "values" for dense
vectors, and "size", "indices", and "values" for sparse    // vectors.
The "values" field is nullable because we might want to add binary
vectors later,    // which uses "size" and "indices", but not
"values".    StructType(Seq(      StructField("channels", IntegerType,
nullable = false),      StructField("width", IntegerType, nullable =
false),      StructField("height", IntegerType, nullable = false),
 StructField("data", BinaryType, nullable = false)  }  override def
serialize(obj: Any): Row = {    val row = new GenericMutableRow(4)
val img = obj.asInstanceOf[ByteImage]*

*...  }  override def deserialize(datum: Any): Vector = {  *


*    }  }  override def pyUDT: String =
"pyspark.mllib.linalg.VectorUDT"  override def userClass:
Class[Vector] = classOf[Vector]}*

I take the VectorUDT as a starting point but there's a lot of thing
that I don't really understand. So any help on defining serialize and
deserialize methods will be appreciated.

Best Regards,


View raw message