数据湖技术Iceberg是如何集成Spark的

2021-05-09 阅读量

五一后仅有一天的周末，无聊中体验了一下iceberg，突然很好奇iceberg到底是如何和Spark集成的，于是撸了会儿源码。

Iceberg小牛试刀，推荐鹅厂李响大佬的帖子：Apache Iceberg快速入门

首先看一个最简单的使用例子：

1
2
3

val df = spark.read.format("iceberg").load("db.table")
df.createOrReplaceTempView("view")
spark.sql("""SELECT * FROM view""").show()

没有看错，使用就是如此简单：read中指定format为iceberg即可。中午躺床上想，spark是怎么iceberg和真正的iceberg代码之间的映射关系的呢？

首先介绍下这里的几个类：

spark: 是 SparkSession 类

read：该方法返回 DataFrameReader

format：该方法返回自己this（DataFrameReader）

load：该方法返回DataFrame（玄机在这里面）

下面直接贴源码看看：

先看format方法：

def format(source: String): DataFrameReader = {
  this.source = source
  this
}

比较简单，将入参 iceberg 赋值给了source变量，后面我们看看source在哪用的。


def load(paths: String*): DataFrame = {
  ... // 此处省略几万行源码～～～
  // 可以看出source(iceberg)传给了 DataSource.lookupDataSourceV2
  DataSource.lookupDataSourceV2(source, sparkSession.sessionState.conf).map { provider =>
    ... // 此处省略几万行源码～～～
  }.getOrElse(loadV1Source(paths: _*))
}

下面继续看DataSource.lookupDataSourceV2：

def lookupDataSourceV2(provider: String, conf: SQLConf): Option[TableProvider] = {
  ... // 此处省略几万行源码～～～
  // provider(source,也就是iceberg)传给了lookupDataSource
  val cls = lookupDataSource(provider, conf)
  ... // 此处省略几万行源码～～～
}

重点来了：

def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
  // 这里形参provider最终传递进来的实参是iceberg，在match中无法匹配到orc， native，hive等等，最后返回的依然是iceberg
  val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match {
    case name if name.equalsIgnoreCase("orc") &&
        conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
      classOf[OrcDataSourceV2].getCanonicalName
    case name if name.equalsIgnoreCase("orc") &&
        conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "hive" =>
      "org.apache.spark.sql.hive.orc.OrcFileFormat"
    case "com.databricks.spark.avro" if conf.replaceDatabricksSparkAvroEnabled =>
      "org.apache.spark.sql.avro.AvroFileFormat"
    // iceberg 最后只能匹配这里
    case name => name
  }
  // provider2的值为 iceberg.DefaultSource
  val provider2 = s"$provider1.DefaultSource"
  val loader = Utils.getContextOrSparkClassLoader
  // 玄机在这里：ServiceLoader会从 META-INF/services 资源目录中加载class类。下面是去找DataSourceRegister类的实现类
  // 该类有个方法叫 shortName()，是实现类的简称
  val serviceLoader = ServiceLoader.load(classOf[DataSourceRegister], loader)
  try {
    // 从DataSourceRegister的实现类中查找shortName为provider1(即iceberg)的类。至此，实现了iceberg和spark的集成。
    serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList match {
      // the provider format did not match any given registered aliases
      case Nil =>
      ... // 此处省略几万行源码～～～
}

再来看看iceberg源码，在源码中搜索DataSourceRegister，果然找见：

资源目录 resources/META-INF/services下文件：org.apache.spark.sql.sources.DataSourceRegister

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#

org.apache.iceberg.spark.source.IcebergSource

查看该实现类，发现他的shortName即为iceberg。

public class IcebergSource implements DataSourceRegister, SupportsCatalogOptions {
	... // 此处省略几万行源码～～～
  @Override
  public String shortName() {
    return "iceberg";
  }
  ... // 此处省略几万行源码～～～

总结

采用ServiceLoader巧妙实现了引擎间的集成，同理可以引申到delta lake，hudi等数据湖技术和spark的集成。

扩展知识

ServiceLoader 类

java api文档：https://docs.oracle.com/javase/7/docs/api/java/util/ServiceLoader.html

抄一段解释：大概意思就是ServiceLoader会从 META-INF/services资源目录下加载class类。

A service provider is identified by placing a provider-configuration file in the resource directory META-INF/services. The file’s name is the fully-qualified binary name of the service’s type. The file contains a list of fully-qualified binary names of concrete provider classes, one per line. Space and tab characters surrounding each name, as well as blank lines, are ignored. The comment character is '#' ('\u0023', NUMBER SIGN); on each line all characters following the first comment character are ignored. The file must be encoded in UTF-8.

本文作者： Jeff.R
本文链接： https://stefanxiepj.github.io/archives/fd226ea6.html
版权声明： 本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。转载请注明出处！