spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: streaming pdf
Date Mon, 19 Nov 2018 06:23:10 GMT
Why does it have to be a stream?

> Am 18.11.2018 um 23:29 schrieb Nicolas Paris <nicolas.paris@riseup.net>:
> 
> Hi
> 
> I have pdf to load into spark with at least <filename, byte_array>
> format. I have considered some options:
> 
> - spark streaming does not provide a native file stream for binary with
>  variable size (binaryRecordStream specifies a constant size) and I
>  would have to write my own receiver.
> 
> - Structured streaming allows to process avro/parquet/orc files
>  containing pdfs, but this makes things more complicated than
>  monitoring a simple folder  containing pdfs
> 
> - Kafka is not designed to handle messages > 100KB, and for this reason
>  it is not a good option to use in the stream pipeline.
> 
> Somebody has a suggestion ?
> 
> Thanks,
> 
> -- 
> nicolas
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message