spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Naming files while saving a Dataframe
Date Sun, 18 Jul 2021 07:44:19 GMT
Spark heavily depends on Hadoop writing files. You can try to set the Hadoop property: mapreduce.output.basename


https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration--


> Am 18.07.2021 um 01:15 schrieb Eric Beabes <mailinglists19@gmail.com>:
> 
> 
> Mich - You're suggesting changing the "Path". Problem is that, we've an EXTERNAL table
created on top of this path so "Path" CANNOT change. If we could, it would be easy to solve
this problem. My question is about changing the "Filename".
> 
> As Ayan pointed out, Spark doesn't seem to allow "prefixes" for the filenames!
> 
>> On Sat, Jul 17, 2021 at 1:58 PM Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>> Using this
>> 
>> df.write.mode("overwrite").format("parquet").saveAsTable("test.ABCD")
>> 
>> That will create a parquet table in the database test. which is essentially a hive
partition in the format
>> 
>> /user/hive/warehouse/test.db/abcd/000000_0
>> 
>> 
>>    view my Linkedin profile
>> 
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>>  
>> 
>> 
>>> On Sat, 17 Jul 2021 at 20:45, Eric Beabes <mailinglists19@gmail.com> wrote:
>>> I am not sure if you've understood the question. Here's how we're saving the
DataFrame:
>>> 
>>> df
>>>   .coalesce(numFiles)
>>>   .write
>>>   .partitionBy(partitionDate)
>>>   .mode("overwrite")
>>>   .format("parquet")
>>>   .save(someDirectory)
>>> 
>>> Now where would I add a 'prefix' in this one?
>>> 
>>>> On Sat, Jul 17, 2021 at 10:54 AM Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>> try it see if it works
>>>> 
>>>> fullyQualifiedTableName = appName+'_'+tableName
>>>> 
>>>> 
>>>> 
>>>>    view my Linkedin profile
>>>> 
>>>>  
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>>>>  
>>>> 
>>>> 
>>>>> On Sat, 17 Jul 2021 at 18:02, Eric Beabes <mailinglists19@gmail.com>
wrote:
>>>>> I don't think Spark allows adding a 'prefix' to the file name, does it?
If it does, please tell me how. Thanks.
>>>>> 
>>>>>> On Sat, Jul 17, 2021 at 9:47 AM Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>>> Jobs have names in spark. You can prefix it to the file name when
writing to directory I guess
>>>>>> 
>>>>>>  val sparkConf = new SparkConf().
>>>>>>                setAppName(sparkAppName).
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>>    view my Linkedin profile
>>>>>> 
>>>>>>  
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for
any loss, damage or destruction of data or any other property which may arise from relying
on this email's technical content is explicitly disclaimed. The author will in no case be
liable for any monetary damages arising from such loss, damage or destruction.
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>>> On Sat, 17 Jul 2021 at 17:40, Eric Beabes <mailinglists19@gmail.com>
wrote:
>>>>>>> Reason we've two jobs writing to the same directory is that the
data is partitioned by 'day' (yyyymmdd) but the job runs hourly. Maybe the only way to do
this is to create an hourly partition (/yyyymmdd/hh). Is that the only way to solve this?
>>>>>>> 
>>>>>>>> On Fri, Jul 16, 2021 at 5:45 PM ayan guha <guha.ayan@gmail.com>
wrote:
>>>>>>>> IMHO - this is a bad idea esp in failure scenarios. 
>>>>>>>> 
>>>>>>>> How about creating a subfolder each for the jobs? 
>>>>>>>> 
>>>>>>>>> On Sat, 17 Jul 2021 at 9:11 am, Eric Beabes <mailinglists19@gmail.com>
wrote:
>>>>>>>>> We've two (or more) jobs that write data into the same
directory via a Dataframe.save method. We need to be able to figure out which job wrote which
file. Maybe provide a 'prefix' to the file names. I was wondering if there's any 'option'
that allows us to do this. Googling didn't come up with any solution so thought of asking
the Spark experts on this mailing list.
>>>>>>>>> 
>>>>>>>>> Thanks in advance.
>>>>>>>> -- 
>>>>>>>> Best Regards,
>>>>>>>> Ayan Guha

Mime
View raw message