spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roland Johann <roland.joh...@phenetic.io.INVALID>
Subject Re: Please help view the problem of spark dynamic partition
Date Fri, 23 Aug 2019 07:57:24 GMT
It seems that column `link_crtd_date` is of type `timestamp` and you therefore partition by
date including time, which produces a huge amount of directories. I assume your intent is
to partition by date (partition_date=yyyy-MM-dd or year=yyyy/month=MM/day=dd) so you need
to format/split your timestamp accordingly, for example:

-- partitioned by 'yyyy-MM-dd'
INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2
PARTITION (partition_date)
SELECT
    date_format(link_crtd_date, 'yyyy-MM-dd') as partition_date,
    *
FROM bsl12.email_edge_lyh_mth1;

-- partitioned by year/month/day
INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2
PARTITION (year, month, day)
SELECT
    year(link_crtd_date, 'yyyy-MM-dd') as year,
    month(link_crtd_date, 'yyyy-MM-dd') as month,
    day(link_crtd_date, 'yyyy-MM-dd') as day,
    *
FROM bsl12.email_edge_lyh_mth1;
Best Regards

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann



> Am 23.08.2019 um 09:43 schrieb zhangliyun <kellyzly@126.com>:
> 
> Hi all:
>   when i use spark dynamic partition feature , i met a problem about hdfs quota.  I found
that it is every easy to meet quota problem (exceed the max value of quota of directory)
> 
> I have generated a unpartitioned table 'bsl12.email_edge_lyh_mth1' which contains 584M
records and will insert it to a  partitioned table "bsl12.email_edge_lyh_partitioned2"
> --select count(*) from bsl12.email_edge_lyh_mth1; --584652128
> --INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2 PARTITION (link_crtd_date)
SELECT * FROM bsl12.email_edge_lyh_mth1;
> 
> 
> when i viewed the temporary directory when sql running, i saw  multiple  file with link_crd_date=2018-01-01***,
I guess one record one temporary file.  as  there are 584M data in the unpartitioned table,
 is there any parameters for us to control the temporary file count  to avoid the quota problem.
> 
> ```
>  
> 133    hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-0112%3A35%3A29
> 137    hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
12%3A35%3A47
> 136    hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
12%3A38%3A23
> 132    hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
12%3A38%3A54
> 536    hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
12%3A40%3A01
> 
> ```
> 
> Best Regards
> 
> Kelly Zhang
> 
> 
>  


Mime
View raw message