Apache Atlas 文档

2021-07-28

关于元数据管理和数据血缘框架atlas的学习

参考链接

https://blog.csdn.net/Zsigner/article/details/115306506

https://atlas.apache.org/#/HighAvailability

https://blog.csdn.net/Milkcoffeezhu/article/details/107049699

数据仓库元数据管理

元数据（MetaData）狭义的解释是用来描述数据的数据。广义的来看，除了业务逻辑直接读写处理的那些业务数据，所有其它用来维持整个系统运转所需的信息／数据都可以叫作元数据。如数据库中表的Schema信息，任务的血缘关系，用户和脚本／任务的权限映射关系信息等。

管理元数据的目的，是为了让用户能够更高效的使用数据，也是为了让平台管理人员能更加有效的做好数据的维护管理工作。

但通常这些元数据信息是散落在平台的各个系统，各种流程之中的，它们的管理也可能或多或少可以通过各种子系统自身的工具，方案或流程逻辑来实现。

元数据管理平台很重要的一个功能就是信息的收集，至于收集哪些信息，取决于业务的需求和需要解决的目标问题。

元数据管理平台还需要考虑如何以恰当的形式对这些元数据信息进行展示；进一步的，如何将这些元数据信息通过服务的形式提供给周边上下游系统使用，真正帮助大数据平台完成质量管理的闭环工作。

应该收集那些信息，没有绝对的标准，但是对大数据开发平台来说，常见的元数据信息包括：

表结构信息
数据的空间存储，读写记录，权限归属和其它各类统计信息
数据的血缘关系信息
数据的业务属性信息

数据血缘关系

血缘信息或者叫做Lineage的血统信息是什么，简单的说就是数据之间的上下游来源去向关系，数据从哪里来到哪里去。如果一个数据有问题，可以根据血缘关系往上游排查，看看到底在哪个环节出了问题。此外也可以通过数据的血缘关系，建立起生产这些数据的任务之间的依赖关系，进而辅助调度系统的工作调度，或者用来判断一个失败或错误的任务可能对哪些下游数据造成影响等等。

分析数据的血缘关系看起来简单，但真的要做起来，并不容易，因为数据的来源多种多样，加工数据的手段，所使用的计算框架可能也各不相同，此外也不是所有的系统天生都具备获取相关信息的能力。而针对不同的系统，血缘关系具体能够分析到的粒度可能也不一样，有些能做到表级别，有些甚至可以做到字段级别。

以Hive表为例，通过分析Hive脚本的执行计划，是可以做到相对精确的定位出字段级别的数据血缘关系的。而如果是一个MapReduce任务生成的数据，从外部来看，可能就只能通过分析MR任务输出的Log日志信息来粗略判断目录级别的读写关系，从而间接推导数据的血缘依赖关系了。

数据的业务属性信息

业务属性信息都有哪些呢？如一张数据表的统计口径信息，这张表干什么用的，各个字段的具体统计方式，业务描述，业务标签，脚本逻辑的历史变迁记录，变迁原因等，此外还包括对应的数据表格是由谁负责开发的，具体数据的业务部门归属等。数据的业务属性信息，首先是为业务服务的，它的采集和展示也就需要尽可能的和业务环境相融合，只有这样才能真正发挥这部分元数据信息的作用。

What is Atlas

很长一段时间内，市面都没有成熟的大数据元数据管理解决方案。直到2015年，Hortonworks终于坐不住了，约了一众小伙伴公司倡议：咱们开始整个数据治理方案吧。然后，包含数据分类、集中策略引擎、数据血缘、安全和生命周期管理功能的Atlas应运而生。(类似的产品还有Linkedin 在2016年新开源的项目 WhereHows ) 。

Atlas是Hadoop平台元数据框架；

Atlas是一组可扩展的核心基础治理服务，使企业能够有效，高效地满足Hadoop中的合规性要求，并能与整个企业数据生态系统集成；

Apache Atlas为组织提供了开放的元数据管理和治理功能，以建立数据资产的目录，对这些资产进行分类和治理，并为IT团队、数据分析团队提供围绕这些数据资产的协作功能。

Atlas High Level Architecture - Overview

Atlas由元数据的收集，存储和查询展示三部分核心组件组成。此外，还会有一个管理后台对整体元数据的采集流程以及元数据格式定义和服务的部署等各项内容进行配置管理。

Atlas包括以下组件：

Core。Atlas功能核心组件，提供元数据的获取与导出(Ingets/Export)、类型系统(TypeSystem)、元数据存储索引查询等核心功能
图形引擎: Atlas在内部使用Graph模型持久保存它管理的元数据对象。这种方法提供了很大的灵活性，可以有效地处理元数据对象之间的丰富关系。图形引擎组件负责在Atlas类型系统的类型和实体之间进行转换，以及底层图形持久性模型。除了管理图形对象之外，图形引擎还为元数据对象创建适当的索引，以便可以有效地搜索它们。 Atlas使用JanusGraph存储元数据对象。
Integration。Atlas对外集成模块。外部组件的元数据通过该模块将元数据交给Atlas管理
Metadata source。Atlas支持的元数据数据源，以插件形式提供。当前支持从以下来源提取和管理元数据：
1
2
3
4
5
6
7
8
9
Hive

HBase

Sqoop

Kafka

Storm
Applications。Atlas的上层应用，可以用来查询由Atlas管理的元数据类型和对象
Graph Engine（图计算引擎）。Altas使用图模型管理元数据对象。图数据库提供了极大的灵活性，并能有效处理元数据对象之间的关系。除了管理图对象之外，图计算引擎还为元数据对象创建适当的索引，以便进行高效的访问。在Atlas 1.0 之前采用Titan作为图存储引擎，从1.0开始采用 JanusGraph 作为图存储引擎。JanusGraph 底层又分为两块：
Metadata Store。采用 HBase 存储 Atlas 管理的元数据；

Index Store。采用Solr存储元数据的索引，便于高效搜索；

Type System

Type

Atlas允许用户为他们想要管理的元数据对象定义模型。该模型由称为type(类型)的定义组成。称为entities(实体)的type(类型)实例表示受管理的实际元数据对象。 Type System是一个允许用户定义和管理类型和实体的组件。开箱即用的Atlas管理的所有元数据对象（例如Hive表）都使用类型建模并表示为实体。要在Atlas中存储新类型的元数据，需要了解类型系统组件的概念。

Type具有元类型。Atlas中有以下元类型：
- 原始元类型(Primitive metatypes)：boolean，byte，short，int，long，float，double，biginteger，bigdecimal，string，date
- 枚举元型(Enum metatypes)
- 集合元类型(Collection metatypes:)：array, map
- 复合元类型(Composite metatypes)：Entity, Struct, Classification, Relationship
举个例子，如下

// Atlas中的类型(Type)由Name唯一标识
// 实体(Entity)和分类(Classification)类型可以从其他类型继承，称为“超类型/父类型”(supertype) ，它包括在超类型中定义的属性。这允许建模者在一组相关类型等中定义公共属性。类似于面向对象语言如何为类定义父类
// 具有元类型Entity，Struct，Classification或Relationship的类型可以具有属性的集合。每个属性都有一个名称（例如: name）
Name:         hive_table
TypeCategory: Entity
SuperTypes:   DataSet
Attributes:
    name:             string
    db:               hive_db
    owner:            string
    createTime:       date
    lastAccessTime:   date
    comment:          string
    retention:        int
    sd:               hive_storagedesc
    partitionKeys:    array<hive_column>
    aliases:          array<string>
    columns:          array<hive_column>
    parameters:       map<string,string>
    viewOriginalText: string
    viewExpandedText: string
    tableType:        string
    temporary:        boolean

Entities(实体)

Atlas中的entity是type的特定值或实例，因此表示现实世界中的特定元数据对象。用我们对面向对象编程语言的类比，实例(instance)是某个类(Class)的对象(Object)。

实体的其中一个示例就是Hive表。Hive在’default’数据库中有一个名为’customers’的表。该表是hive_table类型的Atlas中的“实体”。由于是实体类型的实例，它将具有作为Hive表’type’的一部分的每个属性的值，例如：

// 实体类型的每个实例都由唯一标识符GUID标识。此GUID由Atlas服务器在定义对象时生成，并在实体的整个生命周期内保持不变。在任何时间点，都可以使用其GUID访问此特定实体
// 在此示例中，'customers'表是'hive_table'类型
// 实体类型的实例具有标识(具有GUID值)，并且可以从其他实体引用（例如，从hive_table实体引用hive_db实体）
guid:     "9ba387dd-fa76-429c-b791-ffc338d3c91f"
typeName: "hive_table"
status:   "ACTIVE"
values:
    name:             “customers”
    db:               { "guid": "b42c6cfc-c1e7-42fd-a9e6-890e0adf33bc", "typeName": "hive_db" }
    owner:            “admin”
    createTime:       1490761686029
    updateTime:       1516298102877
    comment:          null
    retention:        0
    sd:               { "guid": "ff58025f-6854-4195-9f75-3a3058dd8dcf", "typeName": "hive_storagedesc" }
    partitionKeys:    null
    aliases:          null
    columns:          [ { "guid": ""65e2204f-6a23-4130-934a-9679af6a211f", "typeName": "hive_column" }, { "guid": ""d726de70-faca-46fb-9c99-cf04f6b579a6", "typeName": "hive_column" }, ...]
    parameters:       { "transient_lastDdlTime": "1466403208"}
    viewOriginalText: null
    viewExpandedText: null
    tableType:        “MANAGED_TABLE”
    temporary:        false

Attributes(属性)

attributes具有以下properties：

name:        string,
    typeName:    string,
    isOptional:  boolean,
    isIndexable: boolean,
    isUnique:    boolean,
    cardinality: enum

Glossary

A Glossary provides appropriate vocabularies for business users and it allows the terms (words) to be related to each other and categorized so that they can be understood in different contexts. These terms can be then mapped to assets like a Database, tables, columns etc. This helps abstract the technical jargon associated with the repositories and allows the user to discover/work with data in the vocabulary that is more familiar to them.

Atlas的术语表(Glossary)提供了一些适当的“单词”，这些“单词”能彼此进行关连和分类，以便业务用户在使用的时候，即使在不同的上下文中也能很好的理解它们。此外，这些术语也是可以映射到数据资产中的，比如：数据库，表，列等。

术语表抽象出了和数据相关的专业术语，使得用户能以他们更熟悉的方式去查找和使用数据。

1. 功能

能够使用自然语言（技术术语和/或业务术语）定义丰富的术语词汇表。
能够将术语在语义上相互关联。
能够将资产映射到术语表中。
能够按类别划分这些术语。这为术语增加了更多的上下文。
允许按层次结构排列类别，能展示更广泛和更精细的范围。
从元数据中独立管理术语表。

2. 术语(Term)

对于企业来说术语作用的非常大的。对于有用且有意义的术语，需要围绕其用途和上下文进行分组。 Apache Atlas中的术语必须具有唯一的qualifiedName，可以有相同名称的术语，但它们不能属于同一个术语表。具有相同名称的术语只能存在于不同的术语表中。

术语名称可以包含空格，下划线和短划线（作为引用单词的自然方式）但不包含“。”或“@”，因为qualifiedName的格式为：<术语>@<术语限定名>。限定名称可以更轻松地使用特定术语。

术语只能属于单个术语表，并且它们的生命周期也是相同的，如果删除术语表，则术语也会被删除。术语可以属于零个或多个类别，这允许将它们限定为更小或更大的上下文。

可以在Apache Atlas中为一个或多个实体分配/链接一个术语。可以使用分类（classifications，类似标签的作用）对术语进行分类，并将相同的分类应用于分配术语的实体。

3. 类别(Category)

类别是组织术语的一种方式，以便可以丰富术语的上下文。

类别可能包含也可能不包含层次结构，即子类别层次结构。类别的qualifiedName是使用它在术语表中的分层位置导出的，例如:<类别名称>.<父类别限定名>。当发生任何层级更改时，此限定名称都会更新，例如：添加父类别，删除父类别或更改父类别。

4. Atlas Web UI

Apache Atlas UI提供了友好的用户界面，可以使用术语表相关的功能，其中包括：

创建术语表，术语和类别
在术语之间创建各种关系: synonymns(同义词)，antonymns(反义词)，seeAlso(参考)
调整类别的层次结构中
为实体分配实体(entities)
使用关联术语搜索实体

与术语表相关的UI都可以在GLOSSARY的Tab下找到。

Classification Propagation

Classification propagation enables classifications associated to an entity to be automatically associated with other related entities of the entity. This is very useful in dealing with scenarios where a dataset derives it data from other datasets - like a table loaded with data in a file, a report generated from a table/view, etc.
For example, when a table is classified as PII, tables or views that derive data from this table (via CTAS or ‘create view’ operation) will be automatically classified as PII.

Consider the following lineage where data from a ‘hdfs_path’ entity is loaded into a table, which is further made available through views. We will go through various scenarios to understand the classification propagation feature.

exp

Add classification to an entity

When classification ‘PII’ is added to ‘hdfs_path’ entity, the classification is propagated to all impacted entities in the lineage path, including ‘employees’ table, views ‘us_employees’ and ‘uk_employees’ - as shown below.

Update classification associated with an entity

Any updates to classifications associated with an entity will be seen in all entities the classification is propagated to as well.

Remove classification associated with an entity

When a classification is deleted from an entity, the classification will be removed from all entities the classification is propagated to as well.

Add lineage between entities

When lineage is added between entities, for example to capture loading of data in a file to a table, the classifications associated with the source entity are propagated to all impacted entities as well. For example, when a view is created from a table, classifications associated with the table are propagated to the newly created view as well.

Delete an entity

Case 1: When an entity is deleted, classifications associated with this entity will be removed from all entities the classifications are propagated to. For example. when employees table is deleted, classifications associated with this table are removed from ‘employees_view’ view.

Case 2: When an entity is deleted in the middle of a lineage path, the propagation link is broken and previously propagated classifications will be removed from all derived entities of the deleted entity. For example. when ‘us_employees’ table is deleted, classifications propagating through this table (PII) are removed from ‘ca_employees’ table, since the only path of propagation is broken by entity deletion.

Case 3: When an entity is deleted in the middle of a lineage path and if there exists alternate path for propagation, previously propagated classifications will be retained. For example. when ‘us_employees’ table is deleted, classifications propagating (PII) through this table are retained in ‘ca_employees’ table, since there are two propagation paths available and only one of them is broken by entity deletion.

Notifications from Apache Atlas

![image-20210727153841844](/Users/louwen/Library/Application Support/typora-user-images/image-20210727153841844.png)

1. Atlas发出的通知

Apache Atlas将有关元数据更改的通知发送到名为ATLAS_ENTITIES的Kafka主题。对元数据更改感兴趣的应用程序可以监视这些通知例如，Apache Ranger处理这些通知以根据分类授权数据访问。

1.1 Notifications - V2: Apache Atlas version 1.0

Apache Atlas 1.0发送有关元数据的以下操作的通知。

ENTITY_CREATE:         sent when an entity instance is created
ENTITY_UPDATE:         sent when an entity instance is updated
ENTITY_DELETE:         sent when an entity instance is deleted
CLASSIFICATION_ADD:    sent when classifications are added to an entity instance
CLASSIFICATION_UPDATE: sent when classifications of an entity instance are updated
CLASSIFICATION_DELETE: sent when classifications are removed from an entity instance

通知包括以下数据。

1
2
3

AtlasEntity               entity;
OperationType             operationType;
List<AtlasClassification> classifications;

2.发送给Atlas的通知

通过向Kafka主题ATLAS_HOOK发送通知，可以向Apache Atlas通知元数据和血缘的修改。 Apache Hive/Apache HBase/Apache Storm/Apache Sqoop的Atlas hook使用此机制向Apache Atlas通知感兴趣的事件。

ENTITY_CREATE            : create an entity. For more details, refer to Java class HookNotificationV1.EntityCreateRequest
ENTITY_FULL_UPDATE       : update an entity. For more details, refer to Java class HookNotificationV1.EntityUpdateRequest
ENTITY_PARTIAL_UPDATE    : update specific attributes of an entity. For more details, refer to HookNotificationV1.EntityPartialUpdateRequest
ENTITY_DELETE            : delete an entity. For more details, refer to Java class HookNotificationV1.EntityDeleteRequest
ENTITY_CREATE_V2         : create an entity. For more details, refer to Java class HookNotification.EntityCreateRequestV2
ENTITY_FULL_UPDATE_V2    : update an entity. For more details, refer to Java class HookNotification.EntityUpdateRequestV2
ENTITY_PARTIAL_UPDATE_V2 : update specific attributes of an entity. For more details, refer to HookNotification.EntityPartialUpdateRequestV2
ENTITY_DELETE_V2         : delete one or more entities. For more details, refer to Java class HookNotification.EntityDeleteRequestV2

安装组建选择

apache-atlas-1.2.0-sources.tar.gz
solr-5.5.1.tgz
hbase-1.1.2.tar.gz

本地编译之后再安装，启动后如下所示

Hive血缘关系导入

官网上有对应的方法，实际操作的时候有一些问题，需要一些额外操作，暂时不展开。

需要在hive中安装对应的hook用来捕获数据，在hive-site.xml中添加如下内容

<property>
  <name>hive.exec.post.hooks</name>
  <value>org.apache.atlas.hive.hook.HiveHook</value>
</property>

Hive hook 可捕获以下操作：

create database
create table/view, create table as select
load, import, export
DMLs (insert)
alter database
alter table
alter view

源码分析

Hbase数据变化流程图

实现对HBaseAtlasHook这个类的加载

public class HBaseAtlasCoprocessor implements MasterObserver, RegionObserver, RegionServerObserver, BulkLoadObserver {
    public static final Log LOG = LogFactory.getLog(HBaseAtlasCoprocessor.class);

    private static final String ATLAS_PLUGIN_TYPE               = "hbase";
    private static final String ATLAS_HBASE_HOOK_IMPL_CLASSNAME = "org.apache.atlas.hbase.hook.HBaseAtlasCoprocessor";

    private AtlasPluginClassLoader atlasPluginClassLoader = null;
    private Object                 impl                     = null;
  	// 实现这四个接口的作用是为了捕获到在HBase中所有的数据变化
    private MasterObserver         implMasterObserver       = null;
    private RegionObserver         implRegionObserver       = null;
    private RegionServerObserver   implRegionServerObserver = null;
    private BulkLoadObserver       implBulkLoadObserver     = null;

    // 在构造生成HBaseAtlasCoprocessor的时候，加载HBaseAtlasHook
    public HBaseAtlasCoprocessor() {
        if(LOG.isDebugEnabled()) {
            LOG.debug("==> HBaseAtlasCoprocessor.HBaseAtlasCoprocessor()");
        }

        // HBaseAtlasCoprocessor类初始化的时候将HBaseAtlasHook加载到HBase集群的内存中
        this.init();

        if(LOG.isDebugEnabled()) {
            LOG.debug("<== HBaseAtlasCoprocessor.HBaseAtlasCoprocessor()");
        }
    }

    private void init(){
        if(LOG.isDebugEnabled()) {
            LOG.debug("==> HBaseAtlasCoprocessor.init()");
        }

        try {
            // 获取HBaseAtlasHook的类加载器
            atlasPluginClassLoader = AtlasPluginClassLoader.getInstance(ATLAS_PLUGIN_TYPE, this.getClass());

            @SuppressWarnings("unchecked")
            Class<?> cls = Class.forName(ATLAS_HBASE_HOOK_IMPL_CLASSNAME, true, atlasPluginClassLoader);

获取`HBase`变化的数据

public class HBaseAtlasCoprocessor extends HBaseAtlasCoprocessorBase {
    private static final Logger LOG = LoggerFactory.getLogger(HBaseAtlasCoprocessor.class);

  	// 定义了hook
    final HBaseAtlasHook hbaseAtlasHook;

    public HBaseAtlasCoprocessor() {
        hbaseAtlasHook = HBaseAtlasHook.getInstance();
    }

    @Override
    public void postCreateTable(ObserverContext<MasterCoprocessorEnvironment> observerContext, HTableDescriptor hTableDescriptor, HRegionInfo[] hRegionInfos) throws IOException {
        if (LOG.isDebugEnabled()) {
            LOG.debug("==> HBaseAtlasCoprocessoror.postCreateTable()");
        }
        hbaseAtlasHook.sendHBaseTableOperation(hTableDescriptor, null, HBaseAtlasHook.OPERATION.CREATE_TABLE);
        if (LOG.isDebugEnabled()) {
            LOG.debug("<== HBaseAtlasCoprocessoror.postCreateTable()");
        }
    }

`HBaseAtlasHook`相关源码

// This will register Hbase entities into Atlas
public class HBaseAtlasHook extends AtlasHook {
  ......
    /**
     * 调用Atlas的消息通知框架将消息发送到Atlas的消息服务器
     * @param hTableDescriptor          HBase表描述器
     * @param tableName                表名称
     * @param operation                对表进行的操作
     */
    public void sendHBaseTableOperation(final HTableDescriptor hTableDescriptor, final TableName tableName, final OPERATION operation) {
        if (LOG.isDebugEnabled()) {
            LOG.debug("==> HBaseAtlasHook.sendHBaseTableOperation()");
        }

        try {
            //封装HBase的操作为Atlas对消息上下文的封装。
            HBaseOperationContext hbaseOperationContext = handleHBaseTableOperation(hTableDescriptor, tableName, operation);

            //将在前面构建好的对HBase操作的上下文发送到Atlas的kafka服务器
            sendNotification(hbaseOperationContext);
        } catch (Throwable t) {
            LOG.error("<== HBaseAtlasHook.sendHBaseTableOperation(): failed to send notification", t);
        }

        if (LOG.isDebugEnabled()) {
            LOG.debug("<== HBaseAtlasHook.sendHBaseTableOperation()");
        }
    }
  ......
    private void sendNotification(HBaseOperationContext hbaseOperationContext) {
        UserGroupInformation ugi = hbaseOperationContext.getUgi();

        if (ugi != null && ugi.getRealUser() != null) {
            ugi = ugi.getRealUser();
        }

        //最终是List<HookNotification> messages把创建表的消息通知过去，最后消息都被封装成HookNotification
        notifyEntities(hbaseOperationContext.getMessages(), ugi);
    }

AtlasHook 相关源码

/**
     * 发送消息到kafka服务器
     * @param messages                  需要发送的消息
     * @param maxRetries                失败之后最大的重试次数
     * @param ugi                       用户认证的ugi
     * @param notificationInterface     Atlas发送消息的框架
     * @param shouldLogFailedMessages   是否打印失败日志
     * @param logger                    日志记录器
     */
    @VisibleForTesting
    static void notifyEntitiesInternal(List<HookNotification> messages, int maxRetries, UserGroupInformation ugi,
                                       NotificationInterface notificationInterface,
                                       boolean shouldLogFailedMessages, FailedMessagesLogger logger) {
        if (messages == null || messages.isEmpty()) {
            return;
        }

        final int maxAttempts         = maxRetries < 1 ? 1 : maxRetries;
        Exception notificationFailure = null;

        for (int numAttempt = 1; numAttempt <= maxAttempts; numAttempt++) {
            if (numAttempt > 1) { // retry attempt
                try {
                    LOG.debug("Sleeping for {} ms before retry", notificationRetryInterval);

                    Thread.sleep(notificationRetryInterval);
                } catch (InterruptedException ie) {
                    LOG.error("Notification hook thread sleep interrupted");

                    break;
                }
            }

            try {
                if (ugi == null) {

Notification相关源码

/**
 * Kafka specific access point to the Atlas notification framework.
 */
@Component
@Order(3)
public class KafkaNotification extends AbstractNotification implements Service {
    public static final Logger LOG = LoggerFactory.getLogger(KafkaNotification.class);

    public    static final String PROPERTY_PREFIX            = "atlas.kafka";
    public    static final String ATLAS_HOOK_TOPIC           = AtlasConfiguration.NOTIFICATION_HOOK_TOPIC_NAME.getString();
    public    static final String ATLAS_ENTITIES_TOPIC       = AtlasConfiguration.NOTIFICATION_ENTITIES_TOPIC_NAME.getString();
    protected static final String CONSUMER_GROUP_ID_PROPERTY = "group.id";

    private static final String DEFAULT_CONSUMER_CLOSED_ERROR_MESSAGE = "This consumer has already been closed.";

    private static final Map<NotificationType, String> TOPIC_MAP = new HashMap<NotificationType, String>() {
        {
            put(NotificationType.HOOK, ATLAS_HOOK_TOPIC);
            put(NotificationType.ENTITIES, ATLAS_ENTITIES_TOPIC);
        }
    };

    private final Properties    properties;
    private final Long          pollTimeOutMs;
    private       KafkaConsumer consumer;
    private       KafkaProducer producer;
    private       String        consumerClosedErrorMsg;

    // ----- Constructors ----------------------------------------------------

    /**
     * Construct a KafkaNotification.
     *
     * @param applicationProperties  the application properties used to configure Kafka
     *
     * @throws AtlasException if the notification interface can not be created
     */
    @Inject
    public KafkaNotification(Configuration applicationProperties) throws AtlasException {
        super(applicationProperties);

        LOG.info("==> KafkaNotification()");

        Configuration kafkaConf = ApplicationProperties.getSubsetConfiguration(applicationProperties, PROPERTY_PREFIX);

        properties             = ConfigurationConverter.getProperties(kafkaConf);
        pollTimeOutMs          = kafkaConf.getLong("poll.timeout.ms", 1000);
        consumerClosedErrorMsg = kafkaConf.getString("error.message.consumer_closed", DEFAULT_CONSUMER_CLOSED_ERROR_MESSAGE);

        //Override default configs
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        boolean oldApiCommitEnableFlag = kafkaConf.getBoolean("auto.commit.enable", false);

        //set old autocommit value if new autoCommit property is not set.
        properties.put("enable.auto.commit", kafkaConf.getBoolean("enable.auto.commit", oldApiCommitEnableFlag));
        properties.put("session.timeout.ms", kafkaConf.getString("session.timeout.ms", "30000"));

        // if no value is specified for max.poll.records, set to 1
        properties.put("max.poll.records", kafkaConf.getInt("max.poll.records", 1));

        LOG.info("<== KafkaNotification()");
    }
  ......
    // ----- AbstractNotification --------------------------------------------
    @Override
    public void sendInternal(NotificationType type, List<String> messages) throws NotificationException {
        // 检测消息服务器的生产者是否创建，
        if (producer == null) {
            //创建消息服务器的生产者
            createProducer();
        }

        //发送消息
        sendInternalToProducer(producer, type, messages);
    }

    // 将消息发送到发送到kafka服务器
    @VisibleForTesting
    void sendInternalToProducer(Producer p, NotificationType type, List<String> messages) throws NotificationException {
        // 这是从hook过来的消息，所以消息是放入到ATLAS_HOOK这个topic
        String               topic           = TOPIC_MAP.get(type);
        List<MessageContext> messageContexts = new ArrayList<>();

        // 异步把所有的消息全部发送消息服务器指定的topic.
        for (String message : messages) {
            ProducerRecord record = new ProducerRecord(topic, message);

            if (LOG.isDebugEnabled()) {
                LOG.debug("Sending message for topic {}: {}", topic, message);
            }

            Future future = p.send(record);

            messageContexts.add(new MessageContext(future, message));
        }
      ......

消费`kafka`消息

start()实现一个后台的job消费从hook发送过来的数据

/**
 * Consumer of notifications from hooks e.g., hive hook etc.
 */
@Component
@Order(4)
@DependsOn(value = {"atlasTypeDefStoreInitializer", "atlasTypeDefGraphStoreV2"})
public class NotificationHookConsumer implements Service, ActiveStateChangeHandler {
  ......
    @Override
    public void start() throws AtlasException {
        // 根据配置文件判断消息消费这是否开启
        if (consumerDisabled) {
            LOG.info("Hook consumer stopped. No hook messages will be processed. " +
                    "Set property '{}' to false to start consuming hook messages.", CONSUMER_DISABLED);
            return;
        }

        startInternal(applicationProperties, null);
    }

    void startInternal(Configuration configuration, ExecutorService executorService) {
        if (consumers == null) {
            consumers = new ArrayList<>();
        }
        if (executorService != null) {
            executors = executorService;
        }
        // 从配置文件读取配置检查HA是否开启,HA没有开启使用内联消费消息
        if (!HAConfiguration.isHAEnabled(configuration)) {
            LOG.info("HA is disabled, starting consumers inline.");

            startConsumers(executorService);
        }
    }

    /**
     * 启动后台线程消费kafka消息
     * @param executorService 异步提交消费任务的线程。
     */
    private void startConsumers(ExecutorService executorService) {
        // 获取配置在配置文件中，消费kafka消息的线程数量。没有配置默认值是1
        int                                          numThreads            = applicationProperties.getInt(CONSUMER_THREADS_PROPERTY, 1);
        // 根据获取到的配置的数量创建对应的数量的消费者
        List<NotificationConsumer<HookNotification>> notificationConsumers = notificationInterface.createConsumers(NotificationType.HOOK, numThreads);

        if (executorService == null) {
            executorService = Executors.newFixedThreadPool(notificationConsumers.size(), new ThreadFactoryBuilder().setNameFormat(THREADNAME_PREFIX + " thread-%d").build());
        }

        executors = executorService;

        for (final NotificationConsumer<HookNotification> consumer : notificationConsumers) {
            //创建kafka消息的消费者
            HookConsumer hookConsumer = new HookConsumer(consumer);

            consumers.add(hookConsumer);
            // 启动线程开始消费消息
            executors.submit(hookConsumer);
        }
    }
  ......

消费消息的处理逻辑

@Override
        public void doWork() {
            LOG.info("==> HookConsumer doWork()");
            // 设置为可运行状态。
            shouldRun.set(true);
            //检测Atlas服务是否正常，当Atlas的服务没有正常启动的时候当前线程进休眠状态，休眠时间1000ms
            if (!serverAvailable(new NotificationHookConsumer.Timer())) {
                return;
            }

            try {
                // 只要是可运行状态会一直运行
                while (shouldRun.get()) {
                    try {
                        // 从kafka的消息服务器中获取数据，由于构造消费者的时候。是ATLAS_HOOK的消息。所以在消费的时候也是消费ATLAS_HOOK里面的消息。
                        List<AtlasKafkaMessage<HookNotification>> messages = consumer.receive();
                        // 遍历从kafka获取的每一个消息
                        for (AtlasKafkaMessage<HookNotification> msg : messages) {
                            // 处理从kafka获取到的每一个消息,并且提交
                            handleMessage(msg);
                        }
                    } catch (IllegalStateException ex) {
                        adaptiveWaiter.pause(ex);
                    } catch (Exception e) {
                        if (shouldRun.get()) {
                            LOG.warn("Exception in NotificationHookConsumer", e);

                            adaptiveWaiter.pause(e);
                        } else {
                            break;
                        }
                    }
                }
            } finally {
                if (consumer != null) {
                    LOG.info("closing NotificationConsumer");

                    consumer.close();
                }

                LOG.info("<== HookConsumer doWork()");
            }
        }
......
  @VisibleForTesting
        void handleMessage(AtlasKafkaMessage<HookNotification> kafkaMsg) throws AtlasServiceException, AtlasException {
            AtlasPerfTracer  perf        = null;
            HookNotification message     = kafkaMsg.getMessage();
            String           messageUser = message.getUser();
            long             startTime   = System.currentTimeMillis();
            boolean          isFailedMsg = false;
            AuditLog         auditLog = null;
            //日志消息
            if (AtlasPerfTracer.isPerfTraceEnabled(PERF_LOG)) {
                perf = AtlasPerfTracer.getPerfTracer(PERF_LOG, message.getType().name());
            }

            try {
                // 判断当前消息是否已经被消费过，消费下一条消息
                if(failedCommitOffsetRecorder.isMessageReplayed(kafkaMsg.getOffset())) {
                    commit(kafkaMsg);
                    return;
                }
                // 预处理从kafka获取的消息，包括处理消息的上下文
                PreprocessorContext context = preProcessNotificationMessage(kafkaMsg);
                // 当前消息为空，消费下一条消息
                if (isEmptyMessage(kafkaMsg)) {
                    commit(kafkaMsg);
                    return;
                }
              ......

将消息持久化

/**
         * 创建或者更新Atlas中的实体
         * @param entities           需要创建或者更新的实体信息
         * @param isPartialUpdate     分区是否更新
         * @throws AtlasBaseException 处理过程中出现的异常
         */
        private void createOrUpdate(AtlasEntitiesWithExtInfo entities, boolean isPartialUpdate, PreprocessorContext context) throws AtlasBaseException {
            List<AtlasEntity> entitiesList = entities.getEntities();
            AtlasEntityStream entityStream = new AtlasEntityStream(entities);
            // 没有设置批量提交的参数，或者当前数据不满足一次批量提交，单次提交。
            if (commitBatchSize <= 0 || entitiesList.size() <= commitBatchSize) {
                // 调用AtlasEntityStore创建的接口创建消息,后续持久化调用相同的接口。
                EntityMutationResponse response = atlasEntityStore.createOrUpdate(entityStream, isPartialUpdate);

                recordProcessedEntities(response, context);
            } else {

数据的校验以及，以及格式的转换

public class EntityMutationResponse {

    private Map<EntityOperation, List<AtlasEntityHeader>> mutatedEntities;
    private Map<String, String>                           guidAssignments;
......

持久化数据

@Component
public class EntityGraphMapper {
    private static final Logger LOG = LoggerFactory.getLogger(EntityGraphMapper.class);
  ......
    public AtlasVertex createVertex(AtlasEntity entity) {
        // 生成全局唯一的guid，调用util工具包直接生成。
        final String guid = UUID.randomUUID().toString();
        return createVertexWithGuid(entity, guid);
    }

    /**
     * 根据所给的guid,以及entity创建顶点
     * @param entity  创建顶点的实体数据
     * @param guid    当前实体的guid
     * @return        创建完成之后的顶点数据的封装
     */
    public AtlasVertex createVertexWithGuid(AtlasEntity entity, String guid) {
        if (LOG.isDebugEnabled()) {
            LOG.debug("==> createVertex({})", entity.getTypeName());
        }

        AtlasEntityType entityType = typeRegistry.getEntityTypeByName(entity.getTypeName());
        // 创建图的顶点
        AtlasVertex ret = createStructVertex(entity);

        for (String superTypeName : entityType.getAllSuperTypes()) {
            AtlasGraphUtilsV2.addEncodedProperty(ret, SUPER_TYPES_PROPERTY_KEY, superTypeName);
        }

        AtlasGraphUtilsV2.setEncodedProperty(ret, GUID_PROPERTY_KEY, guid);
        AtlasGraphUtilsV2.setEncodedProperty(ret, VERSION_PROPERTY_KEY, getEntityVersion(entity));

        GraphTransactionInterceptor.addToVertexCache(guid, ret);

        return ret;
    }
  ......
    private AtlasVertex createStructVertex(AtlasStruct struct) {
        if (LOG.isDebugEnabled()) {
            LOG.debug("==> createStructVertex({})", struct.getTypeName());
        }
        //创建图的节点
        final AtlasVertex ret = graph.addVertex();

        AtlasGraphUtilsV2.setEncodedProperty(ret, ENTITY_TYPE_PROPERTY_KEY, struct.getTypeName());
        AtlasGraphUtilsV2.setEncodedProperty(ret, STATE_PROPERTY_KEY, AtlasEntity.Status.ACTIVE.name());
        AtlasGraphUtilsV2.setEncodedProperty(ret, TIMESTAMP_PROPERTY_KEY, RequestContext.get().getRequestTime());
        AtlasGraphUtilsV2.setEncodedProperty(ret, MODIFICATION_TIMESTAMP_PROPERTY_KEY, RequestContext.get().getRequestTime());
        AtlasGraphUtilsV2.setEncodedProperty(ret, CREATED_BY_KEY, RequestContext.get().getUser());
        AtlasGraphUtilsV2.setEncodedProperty(ret, MODIFIED_BY_KEY, RequestContext.get().getUser());

        if (LOG.isDebugEnabled()) {
            LOG.debug("<== createStructVertex({})", struct.getTypeName());
        }

        return ret;
    }
  ......

Hive Hooks

关于数据治理和元数据管理框架，业界有许多开源的系统，比如Apache Atlas，这些开源的软件可以在复杂的场景下满足元数据管理的需求。其实Apache Atlas对于Hive的元数据管理，使用的是Hive的Hooks。需要进行如下配置：

<property>
    <name>hive.exec.post.hooks</name>
    <value>org.apache.atlas.hive.hook.HiveHook<value/>
</property>

通过Hook监听Hive的各种事件，比如创建表，修改表等，然后按照特定的格式把收集的数据推送到Kafka，最后消费元数据并存储。

Hooks 是一种事件和消息机制，可以将事件绑定在内部 Hive 的执行流程中，而无需重新编译 Hive。Hook 提供了扩展和继承外部组件的方式。根据不同的 Hook 类型，可以在不同的阶段运行。

Pre-execution Hook 在执行引擎执行查询之前被调用。这个需要在 Hive 对查询计划进行过优化之后才可以使用。
Post-execution hooks 在执行计划执行结束结果返回给用户之前被调用。
Failure-execution hooks 在执行计划失败之后被调用。
Pre-driver-run 和 post-driver-run 是在查询运行的时候运行的。
Pre-semantic-analyzer and Post-semantic-analyzer Hook 在 Hive 对查询语句进行语义分析的时候调用。

对于Hive Hooks，给出hive.exec.post.hook的使用案例，该Hooks会在查询执行之后，返回结果之前运行。

Hive hook是hive的钩子函数，可以嵌入HQL执行的过程中运行，比如下面的这几种情况。

具体实现代码如下：

public class CustomPostHook implements ExecuteWithHookContext {
    private static final Logger LOGGER = LoggerFactory.getLogger(CustomPostHook.class);
    // 存储Hive的SQL操作类型
    private static final HashSet<String> OPERATION_NAMES = new HashSet<>();

    // HiveOperation是一个枚举类，封装了Hive的SQL操作类型
    // 监控SQL操作类型
    static {
        // 建表
        OPERATION_NAMES.add(HiveOperation.CREATETABLE.getOperationName());
        // 修改数据库属性
        OPERATION_NAMES.add(HiveOperation.ALTERDATABASE.getOperationName());
        // 修改数据库属主
        OPERATION_NAMES.add(HiveOperation.ALTERDATABASE_OWNER.getOperationName());
        // 修改表属性,添加列
        OPERATION_NAMES.add(HiveOperation.ALTERTABLE_ADDCOLS.getOperationName());
        // 修改表属性,表存储路径
        OPERATION_NAMES.add(HiveOperation.ALTERTABLE_LOCATION.getOperationName());
        // 修改表属性
        OPERATION_NAMES.add(HiveOperation.ALTERTABLE_PROPERTIES.getOperationName());
        // 表重命名
        OPERATION_NAMES.add(HiveOperation.ALTERTABLE_RENAME.getOperationName());
        // 列重命名
        OPERATION_NAMES.add(HiveOperation.ALTERTABLE_RENAMECOL.getOperationName());
        // 更新列,先删除当前的列,然后加入新的列
        OPERATION_NAMES.add(HiveOperation.ALTERTABLE_REPLACECOLS.getOperationName());
        // 创建数据库
        OPERATION_NAMES.add(HiveOperation.CREATEDATABASE.getOperationName());
        // 删除数据库
        OPERATION_NAMES.add(HiveOperation.DROPDATABASE.getOperationName());
        // 删除表
        OPERATION_NAMES.add(HiveOperation.DROPTABLE.getOperationName());
    }

    @Override
    public void run(HookContext hookContext) throws Exception {
        assert (hookContext.getHookType() == HookType.POST_EXEC_HOOK);
        // 执行计划
        QueryPlan plan = hookContext.getQueryPlan();
        // 操作名称
        String operationName = plan.getOperationName();
        logWithHeader("执行的SQL语句: " + plan.getQueryString());
        logWithHeader("操作名称: " + operationName);
        if (OPERATION_NAMES.contains(operationName) && !plan.isExplain()) {
            logWithHeader("监控SQL操作");

            Set<ReadEntity> inputs = hookContext.getInputs();
            Set<WriteEntity> outputs = hookContext.getOutputs();

            for (Entity entity : inputs) {
                logWithHeader("Hook metadata输入值: " + toJson(entity));
            }

            for (Entity entity : outputs) {
                logWithHeader("Hook metadata输出值: " + toJson(entity));
            }

        } else {
            logWithHeader("不在监控范围，忽略该hook!");
        }

    }

    private static String toJson(Entity entity) throws Exception {
        ObjectMapper mapper = new ObjectMapper();
        //  entity的类型
        // 主要包括：
        // DATABASE, TABLE, PARTITION, DUMMYPARTITION, DFS_DIR, LOCAL_DIR, FUNCTION
        switch (entity.getType()) {
            case DATABASE:
                Database db = entity.getDatabase();
                return mapper.writeValueAsString(db);
            case TABLE:
                return mapper.writeValueAsString(entity.getTable().getTTable());
        }
        return null;
    }

    /**
     * 日志格式
     *
     * @param obj
     */
    private void logWithHeader(Object obj) {
        LOGGER.info("[CustomPostHook][Thread: " + Thread.currentThread().getName() + "] | " + obj);
    }
    
}

demo

public class hook_test implements ExecuteWithHookContext {
    public void run(HookContext hookContext) throws Exception {
        System.out.println("A pre hook");
    }
}

使jar包临时生效

1 2	add jar /opt/lagou/servers/hive-2.3.7/lib/hive_hooks-1.0-SNAPSHOT.jar; set hive.exec.pre.hooks=com.xiaoyuyu.hook_test;

执行一个sql之后的结果

show tables;
A pre hook
OK
tab_name
t1
t2
......

atlas中具体的例子

BaseHiveEvent

protected String getQualifiedName(List<AtlasEntity> inputs, List<AtlasEntity> outputs) throws Exception {
        HiveOperation operation = context.getHiveOperation();

        if (operation == HiveOperation.CREATETABLE ||
            operation == HiveOperation.CREATETABLE_AS_SELECT ||
            operation == HiveOperation.CREATEVIEW ||
            operation == HiveOperation.ALTERVIEW_AS ||
            operation == HiveOperation.ALTERTABLE_LOCATION) {
            List<? extends Entity> sortedEntities = new ArrayList<>(getHiveContext().getOutputs());

            Collections.sort(sortedEntities, entityComparator);

            for (Entity entity : sortedEntities) {
                if (entity.getType() == Entity.Type.TABLE) {
                    Table table = entity.getTable();

                    table = getHive().getTable(table.getDbName(), table.getTableName());

                    long createTime = getTableCreateTime(table);

                    return getQualifiedName(table) + QNAME_SEP_PROCESS + createTime;
                }
            }
        }

        StringBuilder sb = new StringBuilder(getHiveContext().getOperationName());

        boolean ignoreHDFSPaths = ignoreHDFSPathsinProcessQualifiedName();

        addToProcessQualifiedName(sb, getHiveContext().getInputs(), ignoreHDFSPaths);
        sb.append("->");
        addToProcessQualifiedName(sb, getHiveContext().getOutputs(), ignoreHDFSPaths);

        return sb.toString();
    }

HiveHook

public HiveHook() {
    }

    @Override
    public void run(HookContext hookContext) throws Exception {
        if (LOG.isDebugEnabled()) {
            LOG.debug("==> HiveHook.run({})", hookContext.getOperationName());
        }

        if (knownObjects != null && knownObjects.isCacheExpired()) {
            LOG.info("HiveHook.run(): purging cached databaseNames ({}) and tableNames ({})", knownObjects.getCachedDbCount(), knownObjects.getCachedTableCount());

            knownObjects = new HiveHookObjectNamesCache(nameCacheDatabaseMaxCount, nameCacheTableMaxCount, nameCacheRebuildIntervalSeconds);
        }

        try {
            HiveOperation        oper    = OPERATION_MAP.get(hookContext.getOperationName());
            AtlasHiveHookContext context = new AtlasHiveHookContext(this, oper, hookContext, knownObjects);

            BaseHiveEvent event = null;

            switch (oper) {
                case CREATEDATABASE:
                    event = new CreateDatabase(context);
                break;

                case DROPDATABASE:
                    event = new DropDatabase(context);
                break;

                case ALTERDATABASE:
                case ALTERDATABASE_OWNER:
                    event = new AlterDatabase(context);
                break;

                case CREATETABLE:
                    event = new CreateTable(context, true);
                break;

                case DROPTABLE:
                case DROPVIEW:
                    event = new DropTable(context);
                break;

                case CREATETABLE_AS_SELECT:
                case CREATEVIEW:
                case ALTERVIEW_AS:
                case LOAD:
                case EXPORT:
                case IMPORT:
                case QUERY:
                case TRUNCATETABLE:
                    event = new CreateHiveProcess(context);
                break;

                case ALTERTABLE_FILEFORMAT:
                case ALTERTABLE_CLUSTER_SORT:
                case ALTERTABLE_BUCKETNUM:
                case ALTERTABLE_PROPERTIES:
                case ALTERVIEW_PROPERTIES:
                case ALTERTABLE_SERDEPROPERTIES:
                case ALTERTABLE_SERIALIZER:
                case ALTERTABLE_ADDCOLS:
                case ALTERTABLE_REPLACECOLS:
                case ALTERTABLE_PARTCOLTYPE:
                case ALTERTABLE_LOCATION:
                    event = new AlterTable(context);
                break;

                case ALTERTABLE_RENAME:
                case ALTERVIEW_RENAME:
                    event = new AlterTableRename(context);
                break;

                case ALTERTABLE_RENAMECOL:
                    event = new AlterTableRenameCol(context);
                break;

                default:
                    if (LOG.isDebugEnabled()) {
                        LOG.debug("HiveHook.run({}): operation ignored", hookContext.getOperationName());
                    }
                break;
            }

            if (event != null) {
                final UserGroupInformation ugi = hookContext.getUgi() == null ? Utils.getUGI() : hookContext.getUgi();

                super.notifyEntities(event.getNotificationMessages(), ugi);
            }
        } catch (Throwable t) {
            LOG.error("HiveHook.run(): failed to process operation {}", hookContext.getOperationName(), t);
        }

        if (LOG.isDebugEnabled()) {
            LOG.debug("<== HiveHook.run({})", hookContext.getOperationName());
        }
    }

本文作者： xiaoyuyu
本文链接： http://woaixiaoyuyu.github.io/2021/07/28/Apache%20Atlas%20%E6%96%87%E6%A1%A3/
版权声明： 本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。转载请注明出处！