OpenTelemetry Gateway 在本方案中承担“区域网关”角色:统一接入 OTLP(gRPC/HTTP)与(可选)Kafka,完成标签规范化、事件 ID 生成、限流/采样、路由与多路扇出,并用 sending_queue + file_storage 构建 持久化 WAL,在后端短抖或升级期间保证不丢与可重放。近线检索走 OpenObserve,“真源”沉淀在 Kafka,再由 ETL 汇入 PostgreSQL/ClickHouse 等数仓。该架构天然支持多区域自治 + 主区统一查询,并为 AI 运维与回放分析夯实数据底座。OpenTelemetryAxoflow
定位(区域网关)
receivers.otlp
(gRPC/HTTP),可选 receivers.kafka
(otlp_proto)。sending_queue
并指定 storage: file_storage
,落到本地 WAL,进程崩溃或重启后可恢复;队列与 WAL 尺度受控。OpenTelemetrySplunk Community职责分层
SLO 目标
[Producers/Agents/Apps]
└─ OTLP gRPC/HTTP / Filelog/Prom→OTLP
→ [OTel Gateway (Region-A, N副本 behind LB)]
├─ 归一/打标签/生成 event_id(OTTL/attributes)
├─ 分流→ OpenObserve.A(近线检索/告警)
└─ 分流→ Kafka.A(*_raw 真源/回放)
↘(ETL)→ PostgreSQL.A / ClickHouse(明细/聚合/向量)
通用处理链(建议)
service.name/host.name/cluster/region/env/tenant
等关键维度。分区自治,主区统一查询
近线与真源
关于 Loki Ingestion
/loki/api/v1/push
)的写入兼容;查询仍走 O2 原生接口或 Grafana O2 插件,不等同于 Loki 查询 API。GitHub
目标是在不改动前端面板(Grafana 使用 Loki 数据源 + LogQL)的前提下,把查询转译为 O2 的 SQL/原生检索,并返回 Loki API 兼容 JSON。
/loki/api/v1/query
、/query_range
、/labels
等。基本映射规则
{app="web", env=~"prod|staging"}
→ O2 标签列等值/正则过滤;time range / order / limit
;|= / |~
→ O2 全文/正则;| json
/ | logfmt
→ O2 动态字段解析;count_over_time()/rate_over_time()
+ sum by(...)
→ O2 分桶聚合 + 维度分组。
status/data/resultType/result
形态(兼容 Grafana)。Grafana Labs
备注:这是查询门面,与采集/转发无耦合;不建议塞进 otelcol 里实现(otelcol 不提供对外用户查询 API)。如果未来要把“日志→迹”的跨信号转换做成组件,建议基于 otelcol connector 自研。OpenTelemetry
目标:把 DeepFlow 的 L4/L7 flow log / metric 聚合成 OTLP Trace(span)并与应用侧 trace 互相补充(网络面观察 + 业务面调用)。 参考做法:
receivers.filelog
或 HTTP/OTLP 转换)。trace_id
/span_id
线索;若无,则按 5 元组 + 起止时间窗 确定性哈希 生成。OpenTelemetry2. 聚合与构造 span
{src_ip, src_port, dst_ip, dst_port, protocol}
+ 方向 + 首包时间窗(例如 1s 滑窗)。net.transport
、net.peer.ip/port
、net.host.ip/port
、RTT、bytes_{sent,received}。http.request.method
、http.response.status_code
、url.path
、user_agent
。db.system
、db.statement
(脱敏)。rpc.system
、rpc.service
、rpc.method
。status.code=ERROR
并附 exception.message
3. 实现形态
flow2trace
(Go),从 Kafka *_raw 消费,产出 OTLP trace 送回 otelcol(receivers.otlp
)。
说明:OTTL/processor 很难“凭空生成 span 对象”并写回 traces pipeline;因此更推荐 Connector 或外部微服务 路径(类似官方
spanmetrics
这种跨信号组件的做法)。OpenTelemetry
extensions:
file_storage:
directory: /var/lib/otelcol/wal
health_check: {}
pprof: { endpoint: 0.0.0.0:1777 }
zpages: {}
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
batch:
send_batch_size: 10000
timeout: 2s
resource/env:
attributes:
- key: region ; action: upsert ; value: region-a
- key: environment ; action: upsert ; value: prod
attributes/normalize:
actions:
- key: cluster ; action: insert ; value: default
- key: tenant ; action: insert ; value: public
- key: service.name ; action: upsert ; from_attribute: service
transform/logs:
error_mode: ignore
log_statements:
- set(attributes.event_id, SHA1(Concat([attributes.tenant, attributes.service.name, body, TruncateTime(timestamp, "1s")], "|")))
tail_sampling:
decision_wait: 5s
num_traces: 50000
expected_new_traces_per_sec: 10000
policies:
- name: error-traces
type: status_code
status_code: { status_codes: [ERROR] }
- name: key-services
type: string_attribute
string_attribute: { key: service.name, values: ["api-gateway","payment"] }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
# OpenObserve(注意:若用 Loki Push API,请用 loki exporter/receiver,见下方提示)
otlphttp/openobserve:
endpoint: https://o2.region-a.example/api/default/
headers: { Authorization: "Basic <REDACTED>" }
retry_on_failure: { enabled: true }
sending_queue:
enabled: true
num_consumers: 8
queue_size: 20000
storage: file_storage
# Kafka 真源(*_raw)
kafka/logs:
brokers: ["kafka-a-1:9092","kafka-a-2:9092","kafka-a-3:9092"]
topic: logs_raw
encoding: otlp_proto
sending_queue: { enabled: true, storage: file_storage }
kafka/metrics:
brokers: ["kafka-a-1:9092","kafka-a-2:9092","kafka-a-3:9092"]
topic: metrics_raw
encoding: otlp_proto
sending_queue: { enabled: true, storage: file_storage }
kafka/traces:
brokers: ["kafka-a-1:9092","kafka-a-2:9092","kafka-a-3:9092"]
topic: traces_raw
encoding: otlp_proto
sending_queue: { enabled: true, storage: file_storage }
service:
extensions: [file_storage, health_check, pprof, zpages]
pipelines:
logs:
receivers: [otlp]
processors: [memory_limiter, resource/env, attributes/normalize, transform/logs, batch]
exporters: [otlphttp/openobserve, kafka/logs]
metrics:
receivers: [otlp]
processors: [memory_limiter, resource/env, attributes/normalize, batch]
exporters: [otlphttp/openobserve, kafka/metrics]
traces:
receivers: [otlp]
processors: [memory_limiter, resource/env, attributes/normalize, tail_sampling, batch]
exporters: [otlphttp/openobserve, kafka/traces]
提示(Loki Push 相关)
/loki/api/v1/push
;避免 OTLP HTTP exporter 额外拼 /v1/logs
造成 404(曾有踩坑 issue)。GitHub<region>.<tenant>.<kind>_<form>.v<schema>
,如:a.public.logs_raw.v1
、a.public.traces_raw.v1
tenant + service.name + yyyyMMddHH
(控热点兼顾有序)*_raw
保留 7–30 天(冷热分层);*_norm
更长;审计类可启用 compaction。tenant
,以此做配额、路由与访问控制。file_storage
的持久化队列。OpenTelemetrymaxUnavailable=1
;灰度观察 exporter_send_failed_total
、队列深度、WAL 占用。receiver_accepted_*
、exporter_queue_size
、CPU/内存。headers_setter
与持久队列会影响 header(如 X-Scope-OrgID
)透传,需关注版本与已修 issue。GitHub
exporter_send_failed_total / sending_queue_size / WAL
;评估恢复窗口与 WAL 容量;必要时临时降采样。*_norm
与关键 traces。|= / |~
+ | json/logfmt
+ count_over_time/rate_over_time
+ sum by()
;/query
、/query_range
;flow2trace
(Kafka *_raw → OTLP spans);或原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。