spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Bradley <jos...@databricks.com>
Subject Re: Need some help to create user defined type for ML pipeline
Date Sun, 25 Jan 2015 01:26:32 GMT
Hi Jao,

You're right that defining serialize and deserialize is the main task in
implementing a UDT.  They are basically translating between your native
representation (ByteImage) and SQL DataTypes.  The sqlType you defined
looks correct, and you're correct to use a row of length 4.  Other than
that, it should just require copying data to and from SQL Rows.  There are
quite a few examples of that in the codebase; I'd recommend searching based
on the particular DataTypes you're using.

Are there particular issues you're running into?

Joseph

On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa <jaonary@gmail.com>
wrote:

> Hi all,
>
> I'm trying to implement a pipeline for computer vision based on the latest
> ML package in spark. The first step of my pipeline is to decode image (jpeg
> for instance) stored in a parquet file.
> For this, I begin to create a UserDefinedType that represents a decoded
> image stored in a array of byte. Here is my first attempt :
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *@SQLUserDefinedType(udt = classOf[ByteImageUDT])class ByteImage(channels: Int, width:
Int, height: Int, data: Array[Byte])private[spark] class ByteImageUDT extends UserDefinedType[ByteImage]
{  override def sqlType: StructType = {    // type: 0 = sparse, 1 = dense    // We only use
"values" for dense vectors, and "size", "indices", and "values" for sparse    // vectors.
The "values" field is nullable because we might want to add binary vectors later,    // which
uses "size" and "indices", but not "values".    StructType(Seq(      StructField("channels",
IntegerType, nullable = false),      StructField("width", IntegerType, nullable = false),
     StructField("height", IntegerType, nullable = false),      StructField("data", BinaryType,
nullable = false)  }  override def serialize(obj: Any): Row = {    val row = new GenericMutableRow(4)
   val img = obj.asInstanceOf[ByteImage]*
>
>
>
>
>
>
> *...  }  override def deserialize(datum: Any): Vector = {  *
>
> *....*
>
>
>
>
>
>
>
>
> *    }  }  override def pyUDT: String = "pyspark.mllib.linalg.VectorUDT"  override def
userClass: Class[Vector] = classOf[Vector]}*
>
>
> I take the VectorUDT as a starting point but there's a lot of thing that I don't really
understand. So any help on defining serialize and deserialize methods will be appreciated.
>
> Best Regards,
>
> Jao
>
>

Mime
View raw message