五一后仅有一天的周末,无聊中体验了一下iceberg,突然很好奇iceberg到底是如何和Spark集成的,于是撸了会儿源码。
Iceberg小牛试刀,推荐鹅厂李响大佬的帖子:Apache Iceberg快速入门
首先看一个最简单的使用例子:
1 | val df = spark.read.format("iceberg").load("db.table") |
没有看错,使用就是如此简单:read中指定format为iceberg
即可。中午躺床上想,spark是怎么iceberg和真正的iceberg代码之间的映射关系的呢?
首先介绍下这里的几个类:
- spark: 是 SparkSession 类
- read:该方法返回 DataFrameReader
- format:该方法返回自己this(DataFrameReader)
- load:该方法返回DataFrame(玄机在这里面)
下面直接贴源码看看:
先看format方法:
1 | def format(source: String): DataFrameReader = { |
比较简单,将入参 iceberg 赋值给了source变量,后面我们看看source在哪用的。
1 |
|
下面继续看DataSource.lookupDataSourceV2:
1 | def lookupDataSourceV2(provider: String, conf: SQLConf): Option[TableProvider] = { |
重点来了:
1 | def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { |
再来看看iceberg源码,在源码中搜索DataSourceRegister,果然找见:
资源目录 resources/META-INF/services
下文件:org.apache.spark.sql.sources.DataSourceRegister
1 | # |
查看该实现类,发现他的shortName即为iceberg。
1 | public class IcebergSource implements DataSourceRegister, SupportsCatalogOptions { |
总结
采用ServiceLoader巧妙实现了引擎间的集成,同理可以引申到delta lake,hudi等数据湖技术和spark的集成。
扩展知识
ServiceLoader 类
java api文档:https://docs.oracle.com/javase/7/docs/api/java/util/ServiceLoader.html
抄一段解释:大概意思就是ServiceLoader会从 META-INF/services
资源目录下加载class类。
A service provider is identified by placing a provider-configuration file in the resource directory
META-INF/services
. The file’s name is the fully-qualified binary name of the service’s type. The file contains a list of fully-qualified binary names of concrete provider classes, one per line. Space and tab characters surrounding each name, as well as blank lines, are ignored. The comment character is'#'
('\u0023'
, NUMBER SIGN); on each line all characters following the first comment character are ignored. The file must be encoded in UTF-8.
本文链接: https://stefanxiepj.github.io/archives/fd226ea6.html
版权声明: 本作品采用 知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议 进行许可。转载请注明出处!