记一次Hive SQL优化提速10倍的案例

2020-09-10 阅读量

自SparkSQL横空出世以来，受到了广大大数据开发同学的热捧。如果说Hive是数仓领域任劳任怨的一头老牛，那Spark SQL好比是一匹枣红快马。我们也于今年(2020)年初启动了离线计算提速专项项目，通过近乎透明的方式实现了一套Hive SQL至Spark SQL的迁移工具。

我们在迁移过程中发现一个比较特殊的案例：

1	该任务由Hive迁至Spark之后, 性能显著提升约 10+ 倍, 历史批次 Hive 运行约 80+ 分钟, 迁移 Spark 后约耗时 7分钟(10分钟以内).

作为一名离线存储计算引擎Hive的自研人员，不禁反问Hive真的如此不堪么？于是针对该案例探究一番，聊以此文抛转引玉，致敬Hive，拥抱Spark。本文旨在通过一个线上实际案例来说明 Hive SQL 任务问题排查的方法论：如何从成百上千行的 sql 脚本中定位到存在问题的 sql 片段，以及如何进行优化，希望能够对大数据开发同学尤其是数仓同学有一定的借鉴意义。

案例解析

本文案例SQL脚本有数百行，从历史批次运行日志中发现，任务运行约80+分钟。摘录一段历史批次日志如下：

[2020-09-14 12:10:06] 开始环境准备...
[2020-09-14 12:10:11] 环境准备结束
[2020-09-14 12:10:11] *************** 开始运行 ***************
USED ENVS AS FOLLOWS:
HADOOP_USER_NAME :  hadoop
HADOOP_QUEUE_NAME : root.root_test
[2020-09-14 12:10:12] start exec : HIVE_SQL 
20/09/14 12:10:13
Welcome~
hive> INSERT OVERWRITE TABLE test_tbl PARTITION (pt= '2020-09-13')
    > SELECT distinct a.order_id
...
-- 自 2020-09-14 12:11:01,644 开始第一个 Job 提交yarn运行;
INFO  : Kill Command = /usr/local/hadoop-current/bin/hadoop job  -kill job_1599894556197_1323786
INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-09-14 12:11:01,644 Stage-1 map = 0%,  reduce = 0%
INFO  : Hadoop job information for Stage-20: number of mappers: 226; number of reducers: 230
-- Stage-20 自2020-09-14 12:11:04开始运行
INFO  : 2020-09-14 12:11:04,838 Stage-20 map = 0%,  reduce = 0%
INFO  : Hadoop job information for Stage-23: number of mappers: 1; number of reducers: 1
...
-- Stage-20 运行持续约1个多小时; 卡顿在 96% 阶段
INFO  : 2020-09-14 12:11:56,287 Stage-20 map = 96%,  reduce = 0%, Cumulative CPU 2144.24 sec
INFO  : 2020-09-14 12:34:25,110 Stage-20 map = 96%,  reduce = 0%, Cumulative CPU 16221.56 sec
INFO  : 2020-09-14 12:35:25,246 Stage-20 map = 96%,  reduce = 0%, Cumulative CPU 16759.57 sec
INFO  : 2020-09-14 12:36:26,078 Stage-20 map = 96%,  reduce = 0%, Cumulative CPU 17577.51 sec
INFO  : 2020-09-14 12:36:56,381 Stage-20 map = 97%,  reduce = 0%, Cumulative CPU 18029.13 sec
INFO  : 2020-09-14 12:37:56,748 Stage-20 map = 97%,  reduce = 0%, Cumulative CPU 18110.17 sec
...
-- Stage-20 运行完成
INFO  : 2020-09-14 13:19:37,443 Stage-20 map = 100%,  reduce = 100%, Cumulative CPU 37094.77 sec
INFO  : MapReduce Total cumulative CPU time: 0 days 10 hours 18 minutes 14 seconds 770 msec
INFO  : Partition test_tbl{pt=2020-09-13} stats: [numFiles=1, numRows=326, totalSize=52397, rawDataSize=646132]
OK
Time taken: 5176.862 seconds
hive>
[2020-09-14 13:36:41] *************** 运行成功[EXIT CODE: 0] ***************

从日志可以看出，任务落地文件1个，落地数据条数326，如此小的数据量，但任务从 12:10 开始运行，直至 13:36 运行结束，该任务为什么这么慢？虽然该任务运行缓慢并没有影响线上业务线的整体产出，但作为有技术洁癖的引擎开发人员而言，这个问题得解。

哪里慢？

遇到调优，用户往往一筹莫展，无从下手。我们的抓手是什么呢？日志、日志、日志，重要的事情说三遍。

Tips： 任何问题我们都可以从日志中发现蛛丝马迹，所以排查问题的第一抓手便是日志，如果 info 日志不够的时候，我们还应该考虑临时开启 debug 日志。

仔细研读运行日志我们发现，任务运行87分钟，而 Stage-20 是瓶颈点，耗时约 1 个小时。那么 Stage-20 是做什么呢？此时便是第二抓手—— SQL 利器 EXPLAIN ，我们通过分析执行计划来定位异常 SQL 片段。

1	Tips： Hive中一个完整的MapReduce阶段代表一个stage。通常当某个stage出现异常时，我们可以通过Explain查看执行计划，来和具体的SQL脚本片段对应起来。

下面是当前案例的执行计划摘要：

OK
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-3 depends on stages: Stage-2, Stage-15, Stage-19, Stage-23
  Stage-4 depends on stages: Stage-3
  Stage-10 depends on stages: Stage-4 , consists of Stage-7, Stage-6, Stage-8
  Stage-7
  Stage-0 depends on stages: Stage-7, Stage-6, Stage-9
  Stage-5 depends on stages: Stage-0
  Stage-6
  Stage-8
  Stage-9 depends on stages: Stage-8
  Stage-15 is a root stage
  Stage-20 is a root stage
  Stage-18 depends on stages: Stage-20
  Stage-19 depends on stages: Stage-18
  Stage-23 is a root stage
...
Stage: Stage-20
    Map Reduce
      Map Operator Tree:
          TableScan
            // 扫描的表名
            alias: test_tbl
            Statistics: Num rows: 67232440 Data size: 1229418456469 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              // 过滤条件
              predicate: (((sub_product_line = 11) and (commercial_status = 1)) and CAST( order_id AS decimal(38,19)) is not null) (type: boolean)
              Statistics: Num rows: 8404055 Data size: 153677307058 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: order_id (type: bigint)
                outputColumnNames: _col0
                Statistics: Num rows: 8404055 Data size: 153677307058 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: CAST( _col0 AS decimal(38,19)) (type: decimal(38,19))
                  sort order: +
                  Map-reduce partition columns: CAST( _col0 AS decimal(38,19)) (type: decimal(38,19))
                  Statistics: Num rows: 8404055 Data size: 153677307058 Basic stats: COMPLETE Column stats: NONE
          TableScan
            alias: tb
            Statistics: Num rows: 84513319 Data size: 99556689782 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (((CAST( order_id AS decimal(38,19)) is not null and concat_ws('-', year, month, day) BETWEEN '2020-09-12' AND '2020-09-14') and id is not null) and (sid) IN (257)) (type: boolean)
              Statistics: Num rows: 5282082 Data size: 6222292596 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: CAST( order_id AS decimal(38,19)) (type: decimal(38,19))
                sort order: +
                Map-reduce partition columns: CAST( order_id AS decimal(38,19)) (type: decimal(38,19))
                Statistics: Num rows: 5282082 Data size: 6222292596 Basic stats: COMPLETE Column stats: NONE
                value expressions: id (type: bigint), oid (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 CAST( order_id AS decimal(38,19)) (type: decimal(38,19))
            1 CAST( _col0 AS decimal(38,19)) (type: decimal(38,19))
          outputColumnNames: _col0, _col2
          Statistics: Num rows: 9244460 Data size: 169045041427 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: default.order_decode(trim(_col2)) (type: string), _col0 (type: bigint)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 9244460 Data size: 169045041427 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (_col0 >= 1) (type: boolean)
              Statistics: Num rows: 3081486 Data size: 56348334951 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: true
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
...
Time taken: 1.335 seconds, Fetched: 519 row(s)

为何慢？

前面知道了哪里慢，下面需要追根溯源，去探究为何慢？

根据 Stage-20 对应的 tracking url，查看对应 RM 日志如下：

mr-log

可以看出 hang 在数据读取上，但线索貌似中断了。往往在这时用户会选择放弃，但作为有技术洁癖的引擎开发人员而言，这个问题得解。

此时轮到另一个抓手 jstack 隆重出场了。

1	Tips：当任务卡住不动时，我们可以通过jstack查看当前线程的状态。

登录当前stage运行的namenode，通过jobId搜索到对应的进程，然后通过jstack查看该进程的相关线程在干什么。

本案例中线程栈如下：

"main" #1 prio=5 os_prio=0 tid=0x00007f130c011800 nid=0x1b30c runnable [0x00007f1315ecd000]
   java.lang.Thread.State: RUNNABLE
        at java.math.BigInteger.squareKaratsuba(BigInteger.java:1979)
        at java.math.BigInteger.square(BigInteger.java:1888)
        at java.math.BigInteger.squareToomCook3(BigInteger.java:2008)
        at java.math.BigInteger.square(BigInteger.java:1890)
        at java.math.BigInteger.squareToomCook3(BigInteger.java:2012)
        at java.math.BigInteger.square(BigInteger.java:1890)
        at java.math.BigInteger.squareToomCook3(BigInteger.java:2011)
        at java.math.BigInteger.square(BigInteger.java:1890)
        at java.math.BigInteger.squareToomCook3(BigInteger.java:2008)
        at java.math.BigInteger.square(BigInteger.java:1890)
        at java.math.BigInteger.squareToomCook3(BigInteger.java:2012)
        at java.math.BigInteger.square(BigInteger.java:1890)
        at java.math.BigInteger.squareToomCook3(BigInteger.java:2011)
        at java.math.BigInteger.square(BigInteger.java:1890)
        at java.math.BigInteger.squareToomCook3(BigInteger.java:2010)
        at java.math.BigInteger.square(BigInteger.java:1890)
        at java.math.BigInteger.squareToomCook3(BigInteger.java:2008)
        at java.math.BigInteger.square(BigInteger.java:1890)
        at java.math.BigInteger.pow(BigInteger.java:2263)
        at java.math.BigDecimal.bigTenToThe(BigDecimal.java:3543)
        at java.math.BigDecimal.bigDigitLength(BigDecimal.java:3820)
        at java.math.BigDecimal.precision(BigDecimal.java:2240)
        at org.apache.hadoop.hive.common.type.HiveDecimal.normalize(HiveDecimal.java:254)
        at org.apache.hadoop.hive.common.type.HiveDecimal.create(HiveDecimal.java:83)
        at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:989)
        at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:354)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDFToDecimal.evaluate(GenericUDFToDecimal.java:71)
        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
        at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator$DeferredExprObject.get(ExprNodeGenericFuncEvaluator.java:87)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotNull.evaluate(GenericUDFOPNotNull.java:53)
        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
        at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator$DeferredExprObject.get(ExprNodeGenericFuncEvaluator.java:87)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd.evaluate(GenericUDFOPAnd.java:59)
        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
        at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator$DeferredExprObject.get(ExprNodeGenericFuncEvaluator.java:87)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd.evaluate(GenericUDFOPAnd.java:59)
        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
        at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator$DeferredExprObject.get(ExprNodeGenericFuncEvaluator.java:87)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd.evaluate(GenericUDFOPAnd.java:59)
        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
        at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)

可以看出是在做 decimal 数据转换，哪里做转换呢？再回过头来看执行计划：

TableScan
            alias: tb
            Statistics: Num rows: 84513319 Data size: 99556689782 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (((CAST( order_id AS decimal(38,19)) is not null and concat_ws('-', year, month, day) BETWEEN '2020-09-12' AND '2020-09-14') and id is not null) and (sid) IN (257)) (type: boolean)
              Statistics: Num rows: 5282082 Data size: 6222292596 Basic stats: COMPLETE Column stats: NONE

(((CAST( order_id AS decimal(38,19)) is not null 中 order_id 字段存在 cast 类型转换，通过查看对应表结构发现，该字段在join的两表中类型不一致，分别为 string 和 bigint 。这里要说明一下在 join 中关联条件字段会做 is not null 过滤。

如何改？

至此我们已经知道了哪里慢、为何慢，下面我们来看如何改？也就是如何调优。

上面我们发现 join 条件字段类型不一致是罪魁祸首，由于每条数据都要做类型转换之后在判断 is not null，从而带来了很大的性能开销。那如果字段类型一致是否是打开潘多拉魔盒的一把钥匙呢？我们改造 SQL 如下：

select default.order_decode(trim(oid)) as order_id
       ,id
from (
    select oid,
            id,
            cast(order_id as bigint) order_id,
            sid,
            year, 
            month, 
            day 
    from dwd.dwd_order) tb
join (
    select order_id as base_id
    from dwd.dwd_order_base
    where 
      dt between date_sub('2020-09-13', 1)
      and date_add('2020-09-13', 1)
      and sub_product_line = 11 
      and status = 1
) base on tb.order_id = base.base_id
where 
  concat_ws('-', year, month, day) 
  between date_sub('2020-09-13', 1)
  and date_add('2020-09-13', 1)
  and id is not null
  and sid in (257)

执行验证效果：

hive> select default.order_decode(trim(oid)) as order_id
    >        ,id
    > from (
    >     select oid,
    >             id,
    >             cast(order_id as bigint) order_id,
    >             sid,
    >             year, 
    >             month, 
    >             day 
    >     from dwd.dwd_order) tb
    > join (
    >     select order_id as base_id
    >     from dwd.dwd_order_base
    >     where 
    >       dt between date_sub('2020-09-13', 1)
    >       and date_add('2020-09-13', 1)
    >       and sub_product_line = 11 
    >       and status = 1
    > ) base on tb.order_id = base.base_id
    > where 
    >   concat_ws('-', year, month, day) 
    >   between date_sub('2020-09-13', 1)
    >   and date_add('2020-09-13', 1)
    >   and id is not null
    >   and sid in (257);
INFO  : Number of reduce tasks not specified. Estimated from input data size: 332
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:330
INFO  : Submitting tokens for job: job_1599894556197_2250181
INFO  : The url to track the job: http://bigdata-XXXX:8088/proxy/application_1599894556197_2250181/
INFO  : Starting Job = job_1599894556197_2250181, Tracking URL = http://bigdata-XXXX:8088/proxy/application_1599894556197_2250181/
INFO  : Kill Command = /usr/local/hadoop-current/bin/hadoop job  -kill job_1599894556197_2250181
INFO  : Hadoop job information for Stage-1: number of mappers: 330; number of reducers: 332
INFO  : 2020-09-15 17:39:51,680 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-09-15 17:40:22,666 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 30.58 sec
INFO  : 2020-09-15 17:40:23,764 Stage-1 map = 3%,  reduce = 0%, Cumulative CPU 80.41 sec
INFO  : 2020-09-15 17:40:28,020 Stage-1 map = 6%,  reduce = 0%, Cumulative CPU 182.56 sec
INFO  : 2020-09-15 17:40:29,086 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU 239.49 sec
INFO  : 2020-09-15 17:40:30,140 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 368.75 sec
INFO  : 2020-09-15 17:40:31,199 Stage-1 map = 19%,  reduce = 0%, Cumulative CPU 555.23 sec
INFO  : 2020-09-15 17:40:32,251 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 749.31 sec
INFO  : 2020-09-15 17:40:33,414 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU 1028.07 sec
INFO  : 2020-09-15 17:40:34,486 Stage-1 map = 46%,  reduce = 0%, Cumulative CPU 1354.82 sec
INFO  : 2020-09-15 17:40:35,541 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU 1637.86 sec
INFO  : 2020-09-15 17:40:36,606 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 1816.13 sec
INFO  : 2020-09-15 17:40:37,670 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 2088.07 sec
INFO  : 2020-09-15 17:40:38,724 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 2400.33 sec
INFO  : 2020-09-15 17:40:39,773 Stage-1 map = 85%,  reduce = 0%, Cumulative CPU 2711.45 sec
INFO  : 2020-09-15 17:40:40,828 Stage-1 map = 88%,  reduce = 0%, Cumulative CPU 2839.74 sec
INFO  : 2020-09-15 17:40:41,880 Stage-1 map = 89%,  reduce = 0%, Cumulative CPU 2870.36 sec
INFO  : 2020-09-15 17:40:42,926 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 2928.23 sec
INFO  : 2020-09-15 17:40:43,986 Stage-1 map = 92%,  reduce = 0%, Cumulative CPU 2975.41 sec
INFO  : 2020-09-15 17:40:45,033 Stage-1 map = 93%,  reduce = 0%, Cumulative CPU 3031.73 sec
INFO  : 2020-09-15 17:40:46,083 Stage-1 map = 94%,  reduce = 0%, Cumulative CPU 3057.46 sec
INFO  : 2020-09-15 17:40:47,159 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU 3143.27 sec
INFO  : 2020-09-15 17:40:49,300 Stage-1 map = 97%,  reduce = 0%, Cumulative CPU 3204.6 sec
INFO  : 2020-09-15 17:40:51,419 Stage-1 map = 98%,  reduce = 0%, Cumulative CPU 3230.04 sec
INFO  : 2020-09-15 17:40:57,975 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 3275.26 sec
INFO  : 2020-09-15 17:41:58,994 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3308.32 sec
INFO  : 2020-09-15 17:42:26,061 Stage-1 map = 100%,  reduce = 2%, Cumulative CPU 3345.54 sec
INFO  : 2020-09-15 17:42:27,455 Stage-1 map = 100%,  reduce = 8%, Cumulative CPU 3447.09 sec
INFO  : 2020-09-15 17:42:28,701 Stage-1 map = 100%,  reduce = 19%, Cumulative CPU 3683.33 sec
INFO  : 2020-09-15 17:42:29,877 Stage-1 map = 100%,  reduce = 31%, Cumulative CPU 3954.8 sec
INFO  : 2020-09-15 17:42:30,941 Stage-1 map = 100%,  reduce = 41%, Cumulative CPU 4164.82 sec
INFO  : 2020-09-15 17:42:32,025 Stage-1 map = 100%,  reduce = 59%, Cumulative CPU 4571.03 sec
INFO  : 2020-09-15 17:42:33,094 Stage-1 map = 100%,  reduce = 74%, Cumulative CPU 4928.07 sec
INFO  : 2020-09-15 17:42:34,149 Stage-1 map = 100%,  reduce = 85%, Cumulative CPU 5177.77 sec
INFO  : 2020-09-15 17:42:35,207 Stage-1 map = 100%,  reduce = 91%, Cumulative CPU 5328.07 sec
INFO  : 2020-09-15 17:42:36,342 Stage-1 map = 100%,  reduce = 94%, Cumulative CPU 5408.56 sec
INFO  : 2020-09-15 17:42:37,437 Stage-1 map = 100%,  reduce = 97%, Cumulative CPU 5476.0 sec
INFO  : 2020-09-15 17:42:39,582 Stage-1 map = 100%,  reduce = 98%, Cumulative CPU 5506.86 sec
INFO  : 2020-09-15 17:42:40,666 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5533.36 sec
INFO  : 2020-09-15 17:42:42,752 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5559.5 sec
INFO  : MapReduce Total cumulative CPU time: 0 days 1 hours 32 minutes 39 seconds 500 msec
INFO  : Ended Job = job_1599894556197_2250181
OK
order_id    id
Time taken: 231.903 seconds

我们发现仅需231秒便运行完毕，较优化前性能提升近10倍。

Spark为何快？

我们前面提到，该案例是我们在做离线计算提速专项中，将Hive任务迁移至Spark发现的，那不禁要问，Spark为何快呢？一起来看Spark的执行计划：

spark-explain

可以看出，上图红框中isnotnull过滤条件中并没有做cast类型转换，而是将该阶段放在了读取数据之后。

一些思考

我们从一个 Hive SQL 任务运行缓慢的实际案例出发，阐述了定位，分析以及调优的经过。从整体来看，Hive SQL 的调优繁琐，往往令关注业务的数仓同学望而却步。而 Spark SQL 以天然的底层优化，降低了用户调优的门槛。本文大篇幅阐述了一个Hive SQL调优的经过，旨在抛砖引玉，推荐用户更多地倾向于使用 Spark 引擎，从而带来更大的性能提升和收益。

本文作者： Jeff.R
本文链接： https://stefanxiepj.github.io/archives/83bf2dd4.html
版权声明： 本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。转载请注明出处！