目标读者:技术负责人 / 平台架构师 / SRE / 数据工程负责人undefined设计取向:平台无关(K8s/VM/Serverless/DB/网关皆可接入),以 MAPE‑K 闭环为核心,强调安全、可审计、可回滚与低摩擦落地。
上传失败:Cannot read properties of undefined (reading 'url')
App -depends_on-> Asset
BusinessService -served_by-> App
App -runs_in-> Environment
Case -targets-> {App|Asset|BusinessService}
Plan -implements-> Case
Observation -belongs_to-> {App|Asset|BusinessService}
Incident -correlates-> Observation
Experiment -scopes-> Environment
Verification -verifies-> {Plan|Experiment}
θ(分析置信度阈值):
结合统计显著性、图谱证据、向量相似度、LLM 自信度。
低于 θ → Case Parked(挂起,等待更多信号或时间窗)
高于 θ → 进入 Planner → Gatekeeper 流程
τ(风险阈值):
公式:τ = f(影响面 × 可回退性 × 复杂度 × 时间窗 × 合规)
高风险(≥ τ) → 强制 Gatekeeper 审批
低风险(< τ) → Planner 可直达 Executor(渐进自治 L2/L3)
状态机新增 Parked 阶段:
Case 挂起,等待:
新的观测事件触发
时间窗口开放(变更禁区解除)
一旦条件满足 → 重新入队 Orchestrator
Executor 执行步骤失败时:
步骤超时 → 自动进入 Rollback 分支
门控不达标 → 自动回滚 + 降级预案
(FeatureFlag、熔断、旁路)
多次回滚失败 → 进入 Mitigate 模式
(旁路、限流、只读切换)
Mitigate 失败 → 强制人工接管通知
-- 资产与应用
CREATE TABLE assets (
id BIGSERIAL PRIMARY KEY,
kind TEXT NOT NULL,
name TEXT NOT NULL,
env TEXT NOT NULL,
owner TEXT,
labels JSONB DEFAULT '{}',
spec JSONB,
created_at TIMESTAMPTZ DEFAULT now(),
UNIQUE(kind, name, env)
);
CREATE TABLE apps (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL,
version TEXT,
build_id TEXT,
deps JSONB,
created_at TIMESTAMPTZ DEFAULT now(),
UNIQUE(name, version)
);
CREATE TABLE business_services (
id BIGSERIAL PRIMARY KEY,
domain TEXT,
name TEXT NOT NULL,
slo JSONB,
owner TEXT,
labels JSONB DEFAULT '{}'
);
-- Case / Plan / Execution / Verification
CREATE TABLE cases (
id BIGSERIAL PRIMARY KEY,
type TEXT NOT NULL, -- deploy/change/incident/experiment
subject_type TEXT NOT NULL, -- App/Asset/BusinessService
subject_id BIGINT NOT NULL,
severity INT, -- 0..5
priority INT, -- 0..5
status TEXT NOT NULL, -- state machine state
confidence NUMERIC, -- 0..1
risk_score NUMERIC, -- 0..1
dedupe_key TEXT,
created_by TEXT,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
tags TEXT[]
);
CREATE TABLE plans (
id BIGSERIAL PRIMARY KEY,
case_id BIGINT REFERENCES cases(id) ON DELETE CASCADE,
dsl JSONB NOT NULL, -- Runbook/Plan DSL 编译后 AST
status TEXT NOT NULL,
risk JSONB, -- 计算明细
approvals JSONB,
created_by TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE executions (
id BIGSERIAL PRIMARY KEY,
plan_id BIGINT REFERENCES plans(id) ON DELETE CASCADE,
step INT NOT NULL,
adapter TEXT NOT NULL, -- gitops/k8s/ff/gateway/dbmigrator
command JSONB NOT NULL, -- 具体操作参数
status TEXT NOT NULL, -- pending/running/success/fail
retries INT DEFAULT 0,
started_at TIMESTAMPTZ,
finished_at TIMESTAMPTZ,
audit JSONB -- 请求/响应摘要、指纹、签名
);
CREATE TABLE verifications (
id BIGSERIAL PRIMARY KEY,
case_id BIGINT REFERENCES cases(id) ON DELETE CASCADE,
gate JSONB NOT NULL, -- KPI Gate 定义
window INTERVAL NOT NULL,
result TEXT, -- pass/fail/timeout
details JSONB,
created_at TIMESTAMPTZ DEFAULT now()
);
-- 原始指标表作为 hypertable
CREATE TABLE http_metrics (
ts TIMESTAMPTZ NOT NULL,
app_id BIGINT NOT NULL,
route TEXT,
status_code INT,
latency_ms DOUBLE PRECISION
);
SELECT create_hypertable('http_metrics', 'ts', if_not_exists => TRUE);
-- 1 分钟 P95 连续聚合
CREATE MATERIALIZED VIEW http_p95_1m
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS bucket,
app_id,
percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95
FROM http_metrics
GROUP BY bucket, app_id;
-- Top‑K(最近 5 分钟)
CREATE MATERIALIZED VIEW app_p95_topk_5m
WITH (timescaledb.continuous) AS
SELECT bucket, app_id, p95,
ROW_NUMBER() OVER (PARTITION BY bucket ORDER BY p95 DESC) AS rk
FROM http_p95_1m
WHERE bucket >= now() - INTERVAL '5 minutes';
-- 创建图
SELECT create_graph('ops');
-- 载入节点/边(示例)
SELECT * FROM cypher('ops', $$
CREATE (a:App {id:1, name:'payments'})-[:DEPENDS_ON]->(b:Asset {id:42, kind:'postgres'})
$$) as (v agtype);
-- k‑hop 依赖展开
SELECT * FROM cypher('ops', $$
MATCH (a:App {name:'payments'})-[:DEPENDS_ON*1..3]->(n)
RETURN n
$$) AS (n agtype);
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE kb_chunks (
id BIGSERIAL PRIMARY KEY,
title TEXT,
doc_url TEXT,
text TEXT,
embedding vector(1536),
tags TEXT[],
rating INT DEFAULT 0
);
CREATE INDEX ON kb_chunks USING hnsw (embedding vector_cosine_ops);
-- 语义相似检索
SELECT id, title, doc_url
FROM kb_chunks
ORDER BY embedding <#> $1
LIMIT 10;
obs.events.{kind}.{subject}
ops.alerts.{service}.{env}
ops.cases.{id}.{state}
ops.plans.{case_id}
ops.exec.{plan_id}.{step}
ops.verifications.{case_id}
ops.knowledge.events
{
"case_id": "C-202508-00123",
"type": "incident",
"subject": {"type": "App", "id": 1, "name": "payments"},
"severity": 3,
"priority": 2,
"dedupe_key": "payments-5xx-spike",
"status": "Analyze",
"evidence": {
"topk": [{"app_id":1, "p95": 1260}],
"graph": {"k_hop": [42]},
"similar": [{"kb_id": 991, "score": 0.83}]
},
"ts": "2025-08-27T07:00:00Z"
}
目标:统一执行模型(可审计/可回滚/可门控/可模拟),既能 GitOps,也能直连执行(应急,受控)。
apiVersion: ops/v1
kind: Runbook
metadata:
name: rollback-payments-to-1.2.3
labels:
case: C-202508-00123
spec:
strategy:
mode: gitops # gitops | direct
autonomy: L2 # L0 提示, L1 需审批, L2 条件自动, L3 全自动
risk:
impact: medium
recoverability: high
complexity: low
timeWindow: "22:00-06:00 Asia/Tokyo"
steps:
- name: bump-helm-values
adapter: gitops
with:
repo: https://git/repo/payments-infra
path: charts/payments/values.yaml
changes:
image.tag: 1.2.3
rollback:
changes:
image.tag: 1.2.2
- name: argo-sync-wait
adapter: argocd
with:
app: payments
timeout: 600s
gates:
- name: http-latency-p95
type: timescale.query
with:
sql: |
SELECT avg(p95) AS p95
FROM http_p95_1m
WHERE app_id = ${app_id}
AND bucket >= now() - INTERVAL '5 minutes';
assert: "p95 < 900" # ms
approvals:
required: [team-payments-oncall]
0..1 归一化:
risk = w1*impact + w2*(1-recoverability) + w3*complexity + w4*time_window_penalty + w5*compliance_level
建议默认: w1=0.35, w2=0.25, w3=0.15, w4=0.15, w5=0.10
阈值 τ: ≤0.35 自动执行;(0.35,0.6] 需审批;>0.6 严禁自动。
package ops.autonomy
default allow = false
allow {
input.plan.strategy.autonomy == "L2"
input.risk_score <= 0.35
time.hour(now) >= 22; time.hour(now) < 6
not forbidden_window
}
forbidden_window {
input.subject.env == "prod"
weekday(now) == 5 # Friday
}
score = α*evidence + β*success_rate + γ*(-risk)
(历史成功率来自知识库元数据)。提醒:每一 Phase 都产出可用价值,不以大改为前提,降低组织阻力。
本方案以 PG 全家桶 为核心数据底座 + 开源可观测 + GitOps,实现 证据 → 策略 → 动作 的透明化可控自治。
WITH before AS (
SELECT avg(p95) AS p95
FROM http_p95_1m
WHERE app_id=$1 AND bucket BETWEEN now()-interval '10 min' AND now()-interval '5 min'
), after AS (
SELECT avg(p95) AS p95
FROM http_p95_1m
WHERE app_id=$1 AND bucket >= now()-interval '5 min'
)
SELECT (b.p95 - a.p95)/b.p95 AS improvement
FROM before b, after a;
-- 要求 improvement >= 0.1 即改善≥10%
系统:你是 OPS Planner。仅输出 Plan DSL。白名单动作:Scale/Flag/Traffic/Helm/DBMigrate。
输入包含:TopK、Graph、SimilarCases。不得包含任何秘密或凭据。
idempotency_key = sha256(adapter + serialized_command + target + window)
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。