spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shubham Chaurasia <shubh.chaura...@gmail.com>
Subject DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state
Date Tue, 09 Oct 2018 06:31:25 GMT
Hi All,

--Spark built with *tags/v2.4.0-rc2*

Consider following DataSourceReader implementation:

public class MyDataSourceReader implements DataSourceReader,
SupportsScanColumnarBatch {

  StructType schema = null;
  Map<String, String> options;

  public MyDataSourceReader(Map<String, String> options) {
    System.out.println("MyDataSourceReader.MyDataSourceReader:
Instantiated...." + this);
    this.options = options;
  }

  @Override
  public List<InputPartition<ColumnarBatch>> planBatchInputPartitions() {
    //variable this.schema is null here since readSchema() was called
on a different instance
    System.out.println("MyDataSourceReader.planBatchInputPartitions: "
+ this + " schema: " + this.schema);
    //more logic......
    return null;
  }

  @Override
  public StructType readSchema() {
    //some logic to discover schema
    this.schema = (new StructType())
        .add("col1", "int")
        .add("col2", "string");
    System.out.println("MyDataSourceReader.readSchema: " + this + "
schema: " + this.schema);
    return this.schema;
  }
}

1) First readSchema() is called on MyDataSourceReader@instance1 which
sets class variable schema.
2) Now when planBatchInputPartitions() is called, it is being called
on a different instance of MyDataSourceReader and hence I am not
getting the value of schema in method planBatchInputPartitions().

How can I get value of schema which was set in readSchema() method, in
planBatchInputPartitions() method?

Console Logs:

scala> mysource.executeQuery("select * from movie").show

MyDataSourceReader.MyDataSourceReader:
Instantiated....MyDataSourceReader@59ea8f1b
MyDataSourceReader.readSchema: MyDataSourceReader@59ea8f1b schema:
StructType(StructField(col1,IntegerType,true),
StructField(col2,StringType,true))
MyDataSourceReader.MyDataSourceReader:
Instantiated....MyDataSourceReader@a3cd3ff
MyDataSourceReader.planBatchInputPartitions:
MyDataSourceReader@a3cd3ff schema: null


Thanks,
Shubham

Mime
View raw message