spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Beabes <mailinglist...@gmail.com>
Subject Re: Naming files while saving a Dataframe
Date Sat, 17 Jul 2021 19:44:47 GMT
I am not sure if you've understood the question. Here's how we're saving
the DataFrame:

df
  .coalesce(numFiles)
  .write
  .partitionBy(partitionDate)
  .mode("overwrite")
  .format("parquet")

  .save(*someDirectory*)


Now where would I add a 'prefix' in this one?


On Sat, Jul 17, 2021 at 10:54 AM Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> try it see if it works
>
> fullyQualifiedTableName = appName+'_'+tableName
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 17 Jul 2021 at 18:02, Eric Beabes <mailinglists19@gmail.com>
> wrote:
>
>> I don't think Spark allows adding a 'prefix' to the file name, does it?
>> If it does, please tell me how. Thanks.
>>
>> On Sat, Jul 17, 2021 at 9:47 AM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Jobs have names in spark. You can prefix it to the file name when
>>> writing to directory I guess
>>>
>>>  val sparkConf = new SparkConf().
>>>                setAppName(sparkAppName).
>>>
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 17 Jul 2021 at 17:40, Eric Beabes <mailinglists19@gmail.com>
>>> wrote:
>>>
>>>> Reason we've two jobs writing to the same directory is that the data is
>>>> partitioned by 'day' (yyyymmdd) but the job runs hourly. Maybe the only way
>>>> to do this is to create an hourly partition (/yyyymmdd/hh). Is that the
>>>> only way to solve this?
>>>>
>>>> On Fri, Jul 16, 2021 at 5:45 PM ayan guha <guha.ayan@gmail.com> wrote:
>>>>
>>>>> IMHO - this is a bad idea esp in failure scenarios.
>>>>>
>>>>> How about creating a subfolder each for the jobs?
>>>>>
>>>>> On Sat, 17 Jul 2021 at 9:11 am, Eric Beabes <mailinglists19@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> We've two (or more) jobs that write data into the same directory
via
>>>>>> a Dataframe.save method. We need to be able to figure out which job
wrote
>>>>>> which file. Maybe provide a 'prefix' to the file names. I was wondering
if
>>>>>> there's any 'option' that allows us to do this. Googling didn't come
up
>>>>>> with any solution so thought of asking the Spark experts on this
mailing
>>>>>> list.
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>

Mime
View raw message