spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Beabes <mailinglist...@gmail.com>
Subject Re: Naming files while saving a Dataframe
Date Sat, 17 Jul 2021 23:13:58 GMT
Mich - You're suggesting changing the "Path". Problem is that, we've an
EXTERNAL table created on top of this path so "Path" CANNOT change. If we
could, it would be easy to solve this problem. My question is about
changing the "Filename".

As Ayan pointed out, Spark doesn't seem to allow "prefixes" for the
filenames!

On Sat, Jul 17, 2021 at 1:58 PM Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Using this
>
> df.write.mode("overwrite").format("parquet").saveAsTable("test.ABCD")
>
> That will create a parquet table in the database test. which is
> essentially a hive partition in the format
>
> /user/hive/warehouse/test.db/abcd/000000_0
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 17 Jul 2021 at 20:45, Eric Beabes <mailinglists19@gmail.com>
> wrote:
>
>> I am not sure if you've understood the question. Here's how we're saving
>> the DataFrame:
>>
>> df
>>   .coalesce(numFiles)
>>   .write
>>   .partitionBy(partitionDate)
>>   .mode("overwrite")
>>   .format("parquet")
>>
>>   .save(*someDirectory*)
>>
>>
>> Now where would I add a 'prefix' in this one?
>>
>>
>> On Sat, Jul 17, 2021 at 10:54 AM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> try it see if it works
>>>
>>> fullyQualifiedTableName = appName+'_'+tableName
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 17 Jul 2021 at 18:02, Eric Beabes <mailinglists19@gmail.com>
>>> wrote:
>>>
>>>> I don't think Spark allows adding a 'prefix' to the file name, does it?
>>>> If it does, please tell me how. Thanks.
>>>>
>>>> On Sat, Jul 17, 2021 at 9:47 AM Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Jobs have names in spark. You can prefix it to the file name when
>>>>> writing to directory I guess
>>>>>
>>>>>  val sparkConf = new SparkConf().
>>>>>                setAppName(sparkAppName).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, 17 Jul 2021 at 17:40, Eric Beabes <mailinglists19@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Reason we've two jobs writing to the same directory is that the data
>>>>>> is partitioned by 'day' (yyyymmdd) but the job runs hourly. Maybe
the only
>>>>>> way to do this is to create an hourly partition (/yyyymmdd/hh). Is
that the
>>>>>> only way to solve this?
>>>>>>
>>>>>> On Fri, Jul 16, 2021 at 5:45 PM ayan guha <guha.ayan@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> IMHO - this is a bad idea esp in failure scenarios.
>>>>>>>
>>>>>>> How about creating a subfolder each for the jobs?
>>>>>>>
>>>>>>> On Sat, 17 Jul 2021 at 9:11 am, Eric Beabes <
>>>>>>> mailinglists19@gmail.com> wrote:
>>>>>>>
>>>>>>>> We've two (or more) jobs that write data into the same directory
>>>>>>>> via a Dataframe.save method. We need to be able to figure
out which job
>>>>>>>> wrote which file. Maybe provide a 'prefix' to the file names.
I was
>>>>>>>> wondering if there's any 'option' that allows us to do this.
Googling
>>>>>>>> didn't come up with any solution so thought of asking the
Spark experts on
>>>>>>>> this mailing list.
>>>>>>>>
>>>>>>>> Thanks in advance.
>>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>>>
>>>>>>

Mime
View raw message