spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sunil Parmar <sunilosu...@gmail.com>
Subject Re: [Beginner] How to save Kafka Dstream data to parquet ?
Date Tue, 06 Mar 2018 05:48:28 GMT
We use Impala to access parquet files in the directories. Any pointers on
achieving at least once semantic with spark streaming or partial files ?

Sunil Parmar

On Fri, Mar 2, 2018 at 2:57 PM, Tathagata Das <tathagata.das1565@gmail.com>
wrote:

> Structured Streaming's file sink solves these problems by writing a
> log/manifest of all the authoritative files written out (for any format).
> So if you run batch or interactive queries on the output directory with
> Spark, it will automatically read the manifest and only process files are
> that are in the manifest, thus skipping any partial files, etc.
>
>
>
> On Fri, Mar 2, 2018 at 1:37 PM, Sunil Parmar <sunilosunil@gmail.com>
> wrote:
>
>> Is there a way to get finer control over file writing in parquet file
>> writer ?
>>
>> We've an streaming application using Apache Apex ( on path of migration
>> to Spark ...story for a different thread). The existing streaming
>> application read JSON from Kafka and writes Parquet to HDFS. We're trying
>> to deal with partial files by writing .tmp files and renaming them as the
>> last step. We only commit offset after rename is successful. This way we
>> get at least once semantic and partial file write issue.
>>
>> Thoughts ?
>>
>>
>> Sunil Parmar
>>
>> On Wed, Feb 28, 2018 at 1:59 PM, Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>>> There is no good way to save to parquet without causing downstream
>>> consistency issues.
>>> You could use foreachRDD to get each RDD, convert it to
>>> DataFrame/Dataset, and write out as parquet files. But you will later run
>>> into issues with partial files caused by failures, etc.
>>>
>>>
>>> On Wed, Feb 28, 2018 at 11:09 AM, karthikus <aswin88us@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have a Kafka stream data and I need to save the data in parquet format
>>>> without using Structured Streaming (due to the lack of Kafka Message
>>>> header
>>>> support).
>>>>
>>>> val kafkaStream =
>>>>       KafkaUtils.createDirectStream(
>>>>         streamingContext,
>>>>         LocationStrategies.PreferConsistent,
>>>>         ConsumerStrategies.Subscribe[String, String](
>>>>           topics,
>>>>           kafkaParams
>>>>         )
>>>>       )
>>>>     // process the messages
>>>>     val messages = kafkaStream.map(record => (record.key, record.value))
>>>>     val lines = messages.map(_._2)
>>>>
>>>> Now, how do I save it as parquet ? All the examples that I have come
>>>> across
>>>> uses SQLContext which is deprecated. ! Any help appreciated !
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>
>

Mime
View raw message