做企業(yè)網(wǎng)站注意些啥百度指數(shù)數(shù)據(jù)分析平臺(tái)官網(wǎng)
測(cè)試環(huán)境:hive on spark
spark版本:3.3.1
- 一、執(zhí)行時(shí)機(jī)
- 二、對(duì)結(jié)果集的影響
- 三、效率對(duì)比
- 1.內(nèi)連接
- 1)on
- 2)where
- 2.外連接
- 1)on
- 2)where
- 四、總結(jié)
- PS
一、執(zhí)行時(shí)機(jī)
sql連接中,where屬于過濾條件,用于對(duì)join的結(jié)果集進(jìn)行過濾,所以理論上的執(zhí)行時(shí)機(jī)在join之后。on屬于關(guān)聯(lián)條件,決定了滿足什么樣條件的數(shù)據(jù)才可以被關(guān)聯(lián)到一起,因此理論上的執(zhí)行時(shí)機(jī)在join時(shí)。
但是,大多數(shù)數(shù)據(jù)庫系統(tǒng)為了提升效率都采用了一些優(yōu)化技術(shù),思想都是將where中的篩選條件或是on中的關(guān)聯(lián)條件盡可能的提前到數(shù)據(jù)源側(cè)進(jìn)行篩選,目的是減少參與關(guān)聯(lián)的數(shù)據(jù)量。因此它們實(shí)際的執(zhí)行時(shí)機(jī)大多時(shí)候和理論上的不同。
二、對(duì)結(jié)果集的影響
內(nèi)連接中,條件放在where或者on中對(duì)結(jié)果集無影響。
外連接中(以左外連接為例),因?yàn)樽笸膺B接是完全保留左表記錄,on在join時(shí)生效,因此最終的結(jié)果集也會(huì)保留左表的全部記錄。where是對(duì)join后的結(jié)果集進(jìn)行操作,所以會(huì)過濾掉一些數(shù)據(jù)導(dǎo)致二者的結(jié)果集不相同。
三、效率對(duì)比
測(cè)試數(shù)據(jù)量如下:
poi_data.poi_res表:數(shù)據(jù)量8300W+
bi_report.mon_ronghe_pv表:分區(qū)表,總數(shù)據(jù)量120E+,本次采用分區(qū)20240522的數(shù)據(jù)關(guān)聯(lián),數(shù)據(jù)量5900W+,其中 bid like ‘1%’ & pv>100 的數(shù)據(jù)量120W+
兩表的關(guān)聯(lián)字段均無重復(fù)值。
1.內(nèi)連接
1)on
selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
and t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==CollectLimit (17)+- * Project (16)+- * SortMergeJoin Inner (15):- * Sort (6): +- AQEShuffleRead (5): +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7): +- Exchange (3): +- * Filter (2): +- Scan hive poi_data.poi_res (1)+- * Sort (14)+- AQEShuffleRead (13)+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)+- Exchange (11)+- * Project (10)+- * Filter (9)+- * ColumnarToRow (8)+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==CollectLimit (27)+- Project (26)+- SortMergeJoin Inner (25):- Sort (20): +- Exchange (19): +- Filter (18): +- Scan hive poi_data.poi_res (1)+- Sort (24)+- Exchange (23)+- Project (22)+- Filter (21)+- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297, name#299, point_x#316, point_y#317], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#297, type#298, name#299, address#300, phone#301, alias#302, post_code#303, catalog_id#304, c..., Partition Cols: []](2) Filter [codegen id : 1]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))(3) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=774](4) ShuffleQueryStage
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#334, pv#335, event_day#338]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#338), (event_day#338 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338](9) Filter [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))(10) Project [codegen id : 2]
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338](11) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=799](12) ShuffleQueryStage
Output [2]: [bid#334, pv#335]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#334, pv#335]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335](17) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000(18) Filter
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))(19) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=759](20) Sort
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))(22) Project
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338](23) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=760](24) Sort
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None(26) Project
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335](27) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: isFinalPlan=true
從物理執(zhí)行計(jì)劃可以看到第(2)步中的Filter使用條件Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))
在t1表讀取源數(shù)據(jù)時(shí)進(jìn)行了過濾,在第(7)步中通過謂詞下推在t2表scan源數(shù)據(jù)時(shí)使用條件PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
進(jìn)行了過濾,兩表都是在數(shù)據(jù)源側(cè)進(jìn)行的數(shù)據(jù)過濾,減少了shuffle和參與join的數(shù)據(jù)量。
2)where
selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
where t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==CollectLimit (17)+- * Project (16)+- * SortMergeJoin Inner (15):- * Sort (6): +- AQEShuffleRead (5): +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7): +- Exchange (3): +- * Filter (2): +- Scan hive poi_data.poi_res (1)+- * Sort (14)+- AQEShuffleRead (13)+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)+- Exchange (11)+- * Project (10)+- * Filter (9)+- * ColumnarToRow (8)+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==CollectLimit (27)+- Project (26)+- SortMergeJoin Inner (25):- Sort (20): +- Exchange (19): +- Filter (18): +- Scan hive poi_data.poi_res (1)+- Sort (24)+- Exchange (23)+- Project (22)+- Filter (21)+- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350, name#352, point_x#369, point_y#370], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#350, type#351, name#352, address#353, phone#354, alias#355, post_code#356, catalog_id#357, c..., Partition Cols: []](2) Filter [codegen id : 1]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))(3) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=908](4) ShuffleQueryStage
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#387, pv#388, event_day#391]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#391), (event_day#391 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391](9) Filter [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))(10) Project [codegen id : 2]
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391](11) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=933](12) ShuffleQueryStage
Output [2]: [bid#387, pv#388]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#387, pv#388]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388](17) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000(18) Filter
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))(19) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=893](20) Sort
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))(22) Project
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391](23) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=894](24) Sort
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None(26) Project
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388](27) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: isFinalPlan=true
物理執(zhí)行計(jì)劃沒有變化,因此可以說,當(dāng)數(shù)據(jù)庫支持謂詞下推時(shí),篩選條件用where還是on沒有區(qū)別,數(shù)據(jù)庫都會(huì)在數(shù)據(jù)源側(cè)進(jìn)行數(shù)據(jù)過濾,減少參與關(guān)聯(lián)的數(shù)據(jù)量。
2.外連接
1)on
selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
and t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==CollectLimit (17)+- * Project (16)+- * SortMergeJoin LeftOuter (15):- * Sort (6): +- AQEShuffleRead (5): +- ShuffleQueryStage (4), Statistics(sizeInBytes=36.5 MiB, rowCount=3.07E+5): +- Exchange (3): +- * LocalLimit (2): +- Scan hive poi_data.poi_res (1)+- * Sort (14)+- AQEShuffleRead (13)+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)+- Exchange (11)+- * Project (10)+- * Filter (9)+- * ColumnarToRow (8)+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==CollectLimit (27)+- Project (26)+- SortMergeJoin LeftOuter (25):- Sort (20): +- Exchange (19): +- LocalLimit (18): +- Scan hive poi_data.poi_res (1)+- Sort (24)+- Exchange (23)+- Project (22)+- Filter (21)+- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403, name#405, point_x#422, point_y#423], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#403, type#404, name#405, address#406, phone#407, alias#408, post_code#409, catalog_id#410, c..., Partition Cols: []](2) LocalLimit [codegen id : 1]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000(3) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1043](4) ShuffleQueryStage
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#440, pv#441, event_day#444]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#444), (event_day#444 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444](9) Filter [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))(10) Project [codegen id : 2]
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444](11) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1067](12) ShuffleQueryStage
Output [2]: [bid#440, pv#441]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#440, pv#441]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441](17) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000(18) LocalLimit
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000(19) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1029](20) Sort
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))(22) Project
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444](23) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1030](24) Sort
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None(26) Project
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441](27) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: isFinalPlan=true
因?yàn)樽箨P(guān)聯(lián),on中的條件屬于連接條件,結(jié)果需要保留左表全部記錄,所以t1表全量讀取,t2表使用了謂詞下推過濾。
2)where
selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
where t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==CollectLimit (17)+- * Project (16)+- * SortMergeJoin Inner (15):- * Sort (6): +- AQEShuffleRead (5): +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7): +- Exchange (3): +- * Filter (2): +- Scan hive poi_data.poi_res (1)+- * Sort (14)+- AQEShuffleRead (13)+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)+- Exchange (11)+- * Project (10)+- * Filter (9)+- * ColumnarToRow (8)+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==CollectLimit (27)+- Project (26)+- SortMergeJoin Inner (25):- Sort (20): +- Exchange (19): +- Filter (18): +- Scan hive poi_data.poi_res (1)+- Sort (24)+- Exchange (23)+- Project (22)+- Filter (21)+- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456, name#458, point_x#475, point_y#476], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#456, type#457, name#458, address#459, phone#460, alias#461, post_code#462, catalog_id#463, c..., Partition Cols: []](2) Filter [codegen id : 1]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))(3) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1176](4) ShuffleQueryStage
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#493, pv#494, event_day#497]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#497), (event_day#497 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497](9) Filter [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))(10) Project [codegen id : 2]
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497](11) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1201](12) ShuffleQueryStage
Output [2]: [bid#493, pv#494]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#493, pv#494]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494](17) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000(18) Filter
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))(19) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1161](20) Sort
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))(22) Project
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497](23) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1162](24) Sort
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None(26) Project
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494](27) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: isFinalPlan=true
where屬于過濾條件,影響左關(guān)聯(lián)的最終結(jié)果,所以執(zhí)行計(jì)劃第(2)步中將where提前到j(luò)oin關(guān)聯(lián)之前按照bid對(duì)t1表進(jìn)行過濾。
四、總結(jié)
假設(shè)數(shù)據(jù)庫系統(tǒng)支持謂詞下推的前提下,
- 內(nèi)連接:內(nèi)連接的兩個(gè)執(zhí)行計(jì)劃中,對(duì)t2表都使用了
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
,對(duì)t1表都使用了Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))
,因此可以說,內(nèi)連接中where和on在執(zhí)行效率上沒區(qū)別。 - 外連接:還是拿左外連接來說,右表相關(guān)的條件會(huì)使用謂詞下推,而左表是否會(huì)提前過濾數(shù)據(jù),取決于where還是on以及篩選條件是否與左表相關(guān),1)當(dāng)為on時(shí),左表的數(shù)據(jù)必須全量讀取,此時(shí)效率的差別主要取決于左表的數(shù)據(jù)量。2)當(dāng)為where時(shí),如果篩選條件涉及到左表,則會(huì)進(jìn)行數(shù)據(jù)的提前過濾,否則左表仍然全量讀取。
PS
在內(nèi)連接的物理執(zhí)行計(jì)劃中,對(duì)poi_res表的過濾單獨(dú)作為一個(gè)Filter步驟(2)Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))
,而對(duì)mon_ronghe_pv表的過濾在第(7)步scan中PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
,二者有什么區(qū)別?查了一些資料,說的是可以將PushedFilters理解為在讀取數(shù)據(jù)時(shí)的過濾,不滿足條件的數(shù)據(jù)直接不讀取。Filter時(shí)將數(shù)據(jù)讀取之后,再判斷是否滿足條件,決定是否參與后續(xù)計(jì)算。
既然都是在數(shù)據(jù)源側(cè)進(jìn)行數(shù)據(jù)過濾,為什么Filter不能像PushedFilters那樣,直接在讀取數(shù)據(jù)的時(shí)候判斷,減少讀入的數(shù)據(jù)量呢,這樣也可以提升效率,這是一開始個(gè)人的疑問。查了一些資料,說的是是否支持在scan時(shí)filter數(shù)據(jù),主要受數(shù)據(jù)源的影響。大數(shù)據(jù)中的存儲(chǔ)方式主要分為行式存儲(chǔ)和列式存儲(chǔ),列式存儲(chǔ)的數(shù)據(jù)存儲(chǔ)方式和豐富的元數(shù)據(jù)對(duì)謂詞下推技術(shù)有更好的支持。當(dāng)前測(cè)試中,mon_ronghe_pv表的存儲(chǔ)格式為parquet,poi_res表存儲(chǔ)格式text。