dw-metadata

name: dw-metadata description: 数仓元数据查询能力，包括表级血缘、字段信息、SQL 代码获取等

数仓元数据查询 Skill

本 skill 封装了数仓元数据的查询能力，直连 Hologres 元数据表，提供表级血缘、字段信息等功能。

触发时机

当用户的请求涉及以下场景时，应使用本 skill：

用户意图	触发关键词	对应能力
查找表的上游依赖	"上游表"、"依赖哪些表"、"数据来源"、"from tables"	`get_table_upstream()`
查找表的下游消费	"下游表"、"被谁用"、"影响范围"、"to tables"	`get_table_downstream()`
查看表的血缘图	"血缘图"、"依赖链"、"lineage graph"、"数据流转"	`get_table_lineage_recursive()`
获取表的 SQL 代码	"表的 SQL"、"建表语句"、"任务代码"、"ETL 逻辑"	`get_table_sql()`
查询表的字段信息	"字段列表"、"列信息"、"column info"、"字段注释"	`fetch_table_metadata()`
分析字段血缘	"字段血缘"、"column lineage"、"字段来源"、"字段加工逻辑"	SQL 解析相关
获取表注释/DDL	"表注释"、"DDL"、"表描述"、"getddl"	`DDLFetcher.get_ddl()`
表业务描述	"这个表是什么"、"表做什么用"、"业务名称"	`TableDescriptionGenerator`

典型问题示例

✅ 应触发本 skill:
- "dwd_wdt_trd_ord_wide_cutout_fd 这个表依赖哪些上游表？"
- "这个表的下游有哪些？影响范围多大？"
- "帮我看看 pay_status 这个字段是怎么加工的"
- "获取 xxx 表的 SQL 代码"
- "这个字段从 ODS 到 ADS 经过了哪些表？"

❌ 不需要触发:
- 纯代码修改、bug 修复
- 前端 UI 相关问题
- 与数仓元数据无关的查询

数据源

1. Hologres 连接（血缘、SQL 代码）

Host: redacted.internal.example
Port: 80
Database: solution

凭证存储在项目根目录 .env 文件中：

PG_HOST, PG_PORT, PG_DATABASE, PG_USER, PG_PASSWORD

核心元数据表

表名	说明	关键字段
`solution.dim_hive_table_rela_fd`	表级血缘关系	`from_table`, `to_table`, `to_task_code_content`
`solution.dim_hive_column_fd`	字段元数据	`table_name`, `column_name`, `column_index`, `column_comment`

2. MCP 服务（表/字段注释、血缘查询）

URL: https://redacted.internal.example/mcp
协议: MCP Streamable HTTP (2024-11-05)

配置文件: mcp_config.json

{
  "mcpServers": {
    "dw-metadata": {
      "url": "https://redacted.internal.example/mcp"
    }
  }
}

可用工具列表（13 个）

工具名	说明
`get_table_code`	获取数仓表的任务代码，支持精确匹配
`get_table_ddl`	获取表的 DDL 结构定义，支持批量查询
`find_upstream`	查找指定表的上游依赖表，支持递归
`find_downstream`	查找指定表的下游消费表，支持递归
`find_table_in_path`	查找从起始表到目标表的血缘路径
`list_consumers`	列出指定表的直接下游消费者
`list_providers`	列出指定表的直接上游提供者
`find_intersection`	查找两个表的血缘交集
`search_code_upstream`	在上游表的任务代码中搜索指定模式
`search_code_downstream`	在下游表的任务代码中搜索指定模式
`find_column_upstream`	查找上游表中包含指定字段名的表
`find_column_downstream`	查找下游表中包含指定字段名的表
`find_column_by_comment`	按字段注释模式搜索上下游表中的字段

使用 MCP 客户端

from dw_metadata.scripts.mcp_client import MCPClient

client = MCPClient()

# 检查服务是否可用
if client.is_available():
    print("MCP 服务可用")

# 列出所有工具
tools = client.list_tools()

# 获取表的 DDL
ddl = client.get_ddl("dws_trd_book_day_fd", "ic_dwsdb")

# 调用任意工具
upstream = client.call_tool("find_upstream", {"table_name": "dws_trd_book_day_fd"})

⚠️ 该服务需要内网访问，如果连接失败会回退到 Hologres 数据源。

能力清单

1. 表级血缘查询

获取上游表

from app.services.lineage_bootstrap import LineageBootstrapService
service = LineageBootstrapService()
upstream = service.get_table_upstream('dwd_wdt_trd_ord_wide_cutout_fd')
# 返回: [{"from_table": "xxx", "from_db": "xxx", "from_warehouse_level": "dim/dwd/dws/ads", ...}]

获取下游表

downstream = service.get_table_downstream('dwd_wdt_trd_ord_wide_cutout_fd')
# 返回: [{"to_table": "xxx", "to_db": "xxx", "to_warehouse_level": "xxx", ...}]

递归血缘图

graph = service.get_table_lineage_recursive(
    table_name='dwd_wdt_trd_ord_wide_cutout_fd',
    direction='downstream',  # 或 'upstream'
    max_depth=3
)
# 返回: {"start_table": "xxx", "tables": {...}, "edges": [...]}

2. 字段元数据查询

获取表的字段列表（按顺序）

service = LineageBootstrapService()
meta_df = service.fetch_table_metadata()
# 筛选特定表
table_cols = meta_df[meta_df['table_name'] == 'dwd_wdt_trd_ord_wide_cutout_fd']
# 字段按 column_index 排序

⚠️ 重要: 元数据必须按 column_index 排序，否则字段血缘解析会完全错误！

3. SQL 代码获取

获取表的创建 SQL

sql = service.get_table_sql('dwd_wdt_trd_ord_wide_cutout_fd')

4. DDL 与注释获取（MCP）

通过 MCP 服务获取表的 DDL，包含表注释和字段注释。

获取表 DDL

from app.services.mcp_ddl_fetcher import DDLFetcher

fetcher = DDLFetcher()
ddl = fetcher.get_ddl('dws_trd_book_day_fd', 'ic_dwsdb')
# 返回: CREATE TABLE ... COMMENT '表注释' ...

获取解析后的表信息

table_info = fetcher.get_table_info('dws_trd_book_day_fd')
# 返回:
# {
#   "table_name": "dws_trd_book_day_fd",
#   "comment": "图书中间表",
#   "columns": [
#     {"column_name": "gmv", "comment": "GMV"},
#     {"column_name": "pay_date", "comment": "日期"}
#   ]
# }

5. 表业务描述生成

自动将技术表名转换为业务可读的描述。

from backend.scripts.routing.table_description_generator import TableDescriptionGenerator

generator = TableDescriptionGenerator()
desc = generator.generate_description('dws_trd_book_day_fd')

print(desc.display_name)   # "图书中间表"
print(desc.data_scope)     # ["仅图书品类"]
print(desc.key_metrics)    # ["GMV", "订单量", ...]

数据来源优先级：

MCP getddl 表注释（最权威）
SQL 代码中的 -- 任务描述：xxx
Hologres column_comment 字段
表名规则解析（兜底）

API 端点

后端服务启动后，可通过以下 API 访问：

端点	说明
`GET /lineage/hologres/table/{name}/upstream`	获取上游表
`GET /lineage/hologres/table/{name}/downstream`	获取下游表
`GET /lineage/hologres/table/{name}/graph?direction=downstream&max_depth=3`	递归血缘图
`GET /lineage/hologres/table/{name}/sql`	获取表 SQL

命令行工具

项目还提供了命令行工具 .agent/tools/db_query.py：

# 精确获取表的 SQL
python .agent/tools/db_query.py dwd_wdt_trd_ord_wide_cutout_fd --mode exact

# 查找下游表
python .agent/tools/db_query.py dwd_wdt_trd_ord_wide_cutout_fd --mode downstream --depth 3

# 查找上游表
python .agent/tools/db_query.py dwd_wdt_trd_ord_wide_cutout_fd --mode upstream --depth 3

# 在下游代码中搜索字段
python .agent/tools/db_query.py dwd_wdt_trd_ord_wide_cutout_fd --mode find-downstream --pattern 'pay_status'

注意事项

字段顺序: 使用 column_index 排序，不要用 column_name 排序
表名格式: 传入短表名（不含库名），如 dwd_wdt_trd_ord_wide_cutout_fd
连接池: 当前每次查询新建连接，高频场景建议优化为连接池

参考文档

元数据表结构参考 - dim_hive_table_rela_fd 和 dim_hive_column_fd 表结构详解
API 返回示例 - 各接口的请求和响应示例
性能建议 - Hologres 查询频率限制、超时处理、连接管理
常见问题 - 字段血缘错误、认证失败等问题排查
版本记录 - 功能迭代和变更历史
测试用例 - 自动化测试脚本