spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <>
Subject streaming pdf
Date Sun, 18 Nov 2018 22:29:00 GMT

I have pdf to load into spark with at least <filename, byte_array>
format. I have considered some options:

- spark streaming does not provide a native file stream for binary with
  variable size (binaryRecordStream specifies a constant size) and I
  would have to write my own receiver.

- Structured streaming allows to process avro/parquet/orc files
  containing pdfs, but this makes things more complicated than
  monitoring a simple folder  containing pdfs

- Kafka is not designed to handle messages > 100KB, and for this reason
  it is not a good option to use in the stream pipeline.

Somebody has a suggestion ?



To unsubscribe e-mail:

View raw message