spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: streaming pdf
Date Tue, 20 Nov 2018 07:07:48 GMT
And you have to write your own input format, but this is not so complicated (probably anyway
recommended for the PDF case)

> Am 20.11.2018 um 08:06 schrieb Jörn Franke <jornfranke@gmail.com>:
> 
> Well, I am not so sure about the use cases, but what about using StreamingContext.fileStream?
> https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream-java.lang.String-scala.Function1-boolean-org.apache.hadoop.conf.Configuration-scala.reflect.ClassTag-scala.reflect.ClassTag-scala.reflect.ClassTag-
> 
> 
>> Am 19.11.2018 um 09:22 schrieb Nicolas Paris <nicolas.paris@riseup.net>:
>> 
>>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
>>> Why does it have to be a stream?
>>> 
>> 
>> Right now I manage the pipelines as spark batch processing. Mooving to
>> stream would add some improvements such:
>> - simplification of the pipeline
>> - more frequent data ingestion
>> - better resource management (?)
>> 
>> 
>>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
>>> Why does it have to be a stream?
>>> 
>>>> Am 18.11.2018 um 23:29 schrieb Nicolas Paris <nicolas.paris@riseup.net>:
>>>> 
>>>> Hi
>>>> 
>>>> I have pdf to load into spark with at least <filename, byte_array>
>>>> format. I have considered some options:
>>>> 
>>>> - spark streaming does not provide a native file stream for binary with
>>>> variable size (binaryRecordStream specifies a constant size) and I
>>>> would have to write my own receiver.
>>>> 
>>>> - Structured streaming allows to process avro/parquet/orc files
>>>> containing pdfs, but this makes things more complicated than
>>>> monitoring a simple folder  containing pdfs
>>>> 
>>>> - Kafka is not designed to handle messages > 100KB, and for this reason
>>>> it is not a good option to use in the stream pipeline.
>>>> 
>>>> Somebody has a suggestion ?
>>>> 
>>>> Thanks,
>>>> 
>>>> -- 
>>>> nicolas
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>> 
>>> 
>> 
>> -- 
>> nicolas
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> 

Mime
View raw message