Well, I am not so sure about the use cases, but what about using StreamingContext.fileStream?
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream-java.lang.String-scala.Function1-boolean-org.apache.hadoop.conf.Configuration-scala.reflect.ClassTag-scala.reflect.ClassTag-scala.reflect.ClassTag-


Am 19.11.2018 um 09:22 schrieb Nicolas Paris <nicolas.paris@riseup.net>:

On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
Why does it have to be a stream?


Right now I manage the pipelines as spark batch processing. Mooving to
stream would add some improvements such:
- simplification of the pipeline
- more frequent data ingestion
- better resource management (?)


On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
Why does it have to be a stream?

Am 18.11.2018 um 23:29 schrieb Nicolas Paris <nicolas.paris@riseup.net>:

Hi

I have pdf to load into spark with at least <filename, byte_array>
format. I have considered some options:

- spark streaming does not provide a native file stream for binary with
variable size (binaryRecordStream specifies a constant size) and I
would have to write my own receiver.

- Structured streaming allows to process avro/parquet/orc files
containing pdfs, but this makes things more complicated than
monitoring a simple folder  containing pdfs

- Kafka is not designed to handle messages > 100KB, and for this reason
it is not a good option to use in the stream pipeline.

Somebody has a suggestion ?

Thanks,

--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org



--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org