跳转到正文
zeno's blog
返回

可观测性(一):三大支柱-Logs、Traces、Metrics

专题: 可观测性

Table of contents

Open Table of contents

1. 三支柱分别解决什么问题

                        出了问题

            ┌──────────────┼──────────────┐
            ▼              ▼              ▼
         Metrics        Traces          Logs
        "有问题"       "哪里慢"       "为什么错"

    请求量突然掉了     这个请求经过了     具体的报错信息
    错误率升到5%      哪些服务/函数      堆栈、上下文
    P99延迟300ms     每步花了多久        事件时间线
支柱回答的问题数据特征典型工具
Metrics系统整体健不健康?数值聚合,时间序列Prometheus + Grafana
Traces某个请求慢在哪一步?分布式调用链,span 树Jaeger / Tempo
Logs某个错误具体怎么回事?离散的文本/结构化事件Loki / Elasticsearch

三者互相关联:Metrics 告警 → Traces 定位到具体请求 → Logs 看详细错误。

2. Rust 生态的工具链

你的 Rust 代码

     │ 埋点 API

┌────────────┐
│  tracing    │  Rust 统一的 instrumentation 框架
│  crate      │  同时产生 logs + spans(traces 的基础)
└─────┬──────┘
      │ subscriber / layer 机制

┌──────────────────────────────────────────────┐
│               Subscriber Layers               │
│                                              │
│  tracing-subscriber   → 控制台/文件 日志输出   │
│  tracing-opentelemetry → 转换为 OTel spans    │
│  tracing-appender     → 日志文件轮转          │
│  metrics crate        → 指标采集              │
└──────────┬───────────────────────────────────┘
           │ OTLP (OpenTelemetry Protocol)

┌────────────────────────────┐
│  OpenTelemetry Collector    │  统一收集 → 转发
│  (可选,也可直接导出)        │
└──────┬─────────┬─────────┬─┘
       ▼         ▼         ▼
   Prometheus   Jaeger    Loki        ← 存储后端
       │         │         │
       └─────────┴─────────┘

              Grafana              ← 统一看板

3. Cargo.toml 依赖

[dependencies]
# --- 核心 ---
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }

# --- OpenTelemetry (Traces + Metrics) ---
opentelemetry = "0.29"
opentelemetry_sdk = { version = "0.29", features = ["rt-tokio"] }
opentelemetry-otlp = { version = "0.29", features = ["grpc-tonic"] }
tracing-opentelemetry = "0.30"

# --- Metrics (Prometheus 暴露) ---
metrics = "0.24"
metrics-exporter-prometheus = "0.16"

# --- 运行时 ---
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }

4. Logs(日志)

tracing 的日志能力

tracing 不只是日志库——它是一个 structured diagnostics 框架。info!() 等宏产生的是带结构化字段的 event。

use tracing::{info, warn, error, debug, trace};

// 基本用法(和 log 库一样的体验)
info!("server started on port {}", 50051);
warn!("client {} failed to authenticate", client_id);
error!("database connection lost");

// 结构化字段(关键区别:字段可被检索和过滤)
info!(
    player_id = %player.id,
    action = "login",
    latency_ms = elapsed.as_millis() as u64,
    "player logged in"
);
// 输出 JSON 模式:
// {"timestamp":"...","level":"INFO","player_id":"p_001","action":"login","latency_ms":23,"message":"player logged in"}

#[instrument]:自动为函数生成 span + 日志

use tracing::instrument;

#[instrument(skip(pool))]  // skip 不可 Debug 的参数
async fn get_player(pool: &PgPool, player_id: i32) -> Result<Player, AppError> {
    info!("fetching player from database");
    let player = sqlx::query_as("SELECT * FROM players WHERE id = $1")
        .bind(player_id)
        .fetch_one(pool)
        .await?;
    Ok(player)
}

// 自动生成:
// span: get_player{player_id=42}
//   event: INFO "fetching player from database"
//   如果返回 Err,自动记录 error event

Subscriber 配置

use tracing_subscriber::{fmt, EnvFilter, layer::SubscriberExt, util::SubscriberInitExt};

fn init_logging() {
    tracing_subscriber::registry()
        // 日志级别过滤(支持环境变量 RUST_LOG)
        .with(EnvFilter::try_from_default_env()
            .unwrap_or_else(|_| EnvFilter::new("info,mini_tarkov_server=debug")))
        // 输出到控制台
        .with(fmt::layer()
            .json()             // JSON 格式(生产环境推荐)
            .with_target(true)  // 显示模块路径
            .with_thread_ids(true)
            .with_span_events(fmt::format::FmtSpan::CLOSE)) // span 关闭时打印耗时
        .init();
}
# 运行时通过环境变量控制日志级别
RUST_LOG=debug cargo run
RUST_LOG=mini_tarkov_server::handlers=trace,sqlx=warn cargo run

日志输出到文件(轮转)

use tracing_appender::rolling;

fn init_file_logging() {
    // 每天一个日志文件
    let file_appender = rolling::daily("/var/log/mini_tarkov", "server.log");
    let (non_blocking, _guard) = tracing_appender::non_blocking(file_appender);
    // _guard 必须持有,drop 后日志写入停止

    tracing_subscriber::registry()
        .with(EnvFilter::new("info"))
        // 文件输出 JSON
        .with(fmt::layer()
            .json()
            .with_writer(non_blocking))
        // 控制台输出人类可读
        .with(fmt::layer()
            .pretty())
        .init();
}

日志收集到 Loki

应用 (JSON stdout) → Docker 日志驱动 → Promtail → Loki → Grafana

或者:
应用 → tracing-opentelemetry → OTel Collector → Loki

日志不需要额外的 SDK,容器化环境下 stdout JSON 输出 + Promtail 采集是最简单的方案。


5. Traces(分布式追踪)

核心概念

一个 gRPC 请求的 trace:

Trace (trace_id: abc123)

├─ Span: gRPC Login                         [0ms ─────────── 85ms]
│  │ service: mini_tarkov_server
│  │ rpc.method: Login
│  │
│  ├─ Span: validate_credentials             [5ms ──── 20ms]
│  │  │ db.system: postgresql
│  │  │ db.statement: SELECT * FROM players...
│  │  │
│  │  └─ Span: pg_query                      [8ms ── 18ms]
│  │     db.rows_affected: 1
│  │
│  ├─ Span: generate_token                   [22ms ── 25ms]
│  │
│  └─ Span: cache_session                    [26ms ── 30ms]
│     net.peer.name: redis
│     db.operation: SET

└─ 85ms total

Trace = 一个请求的完整调用链(用 trace_id 关联)。 Span = 调用链中的一步操作(有开始时间、结束时间、父子关系、附加字段)。

接入 OpenTelemetry Traces

use opentelemetry::global;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::{trace::SdkTracerProvider, Resource};
use tracing_opentelemetry::OpenTelemetryLayer;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};

fn init_tracing() {
    // OTLP exporter(发送 spans 到 Jaeger / Tempo / OTel Collector)
    let exporter = opentelemetry_otlp::SpanExporter::builder()
        .with_tonic()
        .with_endpoint("http://localhost:4317")
        .build()
        .expect("failed to build span exporter");

    let resource = Resource::builder()
        .with_service_name("mini-tarkov-server")
        .build();

    let tracer_provider = SdkTracerProvider::builder()
        .with_batch_exporter(exporter)
        .with_resource(resource)
        .build();
    global::set_tracer_provider(tracer_provider.clone());

    let otel_layer = OpenTelemetryLayer::new(tracer_provider.tracer("mini-tarkov"));

    tracing_subscriber::registry()
        .with(tracing_subscriber::EnvFilter::new("info"))
        .with(tracing_subscriber::fmt::layer())  // 控制台日志
        .with(otel_layer)                         // OTel traces
        .init();
}

在代码中产生 Span

use tracing::{instrument, info, Span};

// 方式1: #[instrument] 宏(最常用)
#[instrument(skip(pool, redis))]
async fn handle_login(
    pool: &PgPool,
    redis: &redis::Client,
    username: String,
    password: String,
) -> Result<LoginResponse, AppError> {
    let player = validate_credentials(pool, &username, &password).await?;
    let token = generate_token(&player);
    cache_session(redis, &token, player.id).await?;
    Ok(LoginResponse { token, player_id: player.id })
}

// 方式2: 手动创建 span(更灵活)
async fn process_game_tick(tick: u32) {
    let span = tracing::info_span!("game_tick", tick, player_count = tracing::field::Empty);
    let _guard = span.enter();

    let count = update_all_players().await;
    Span::current().record("player_count", count);

    broadcast_snapshot().await;
}

// 跨服务传播 trace context:
// tonic 的 gRPC metadata 自动携带 traceparent header
// 下游服务解析后,span 自动挂在同一个 trace 下

6. Metrics(指标)

四种指标类型

Counter (计数器):
  只增不减。如:总请求数、总错误数
  requests_total = 15234

Gauge (仪表盘):
  可增可减。如:当前连接数、内存使用量、在线玩家数
  connected_players = 42

Histogram (直方图):
  记录值的分布。如:请求延迟分布
  自动算 P50/P90/P99
  request_duration_seconds{quantile="0.99"} = 0.25

Summary:
  类似 Histogram,在客户端计算分位数(不推荐,用 Histogram)

方案 A:metrics crate + Prometheus exporter

最简单的方案,暴露一个 /metrics HTTP 端点给 Prometheus 抓取。

use metrics::{counter, gauge, histogram};
use metrics_exporter_prometheus::PrometheusBuilder;

fn init_metrics() {
    // 在 0.0.0.0:9000 暴露 /metrics 端点
    PrometheusBuilder::new()
        .with_http_listener(([0, 0, 0, 0], 9000))
        .install()
        .expect("failed to install Prometheus exporter");
}

// 在业务代码中使用
async fn handle_request(method: &str) {
    let start = std::time::Instant::now();

    // Counter: 请求总数
    counter!("grpc_requests_total", "method" => method.to_string()).increment(1);

    // Gauge: 在线玩家数
    gauge!("online_players").set(get_online_count() as f64);

    // ... 业务逻辑 ...

    // Histogram: 请求延迟
    let duration = start.elapsed().as_secs_f64();
    histogram!("grpc_request_duration_seconds", "method" => method.to_string())
        .record(duration);
}
# 验证 metrics 端点
curl http://localhost:9000/metrics

# 输出:
# grpc_requests_total{method="Login"} 1523
# grpc_requests_total{method="GetInventory"} 8234
# online_players 42
# grpc_request_duration_seconds_bucket{method="Login",le="0.01"} 1200
# grpc_request_duration_seconds_bucket{method="Login",le="0.05"} 1480
# grpc_request_duration_seconds_bucket{method="Login",le="0.1"} 1520
# grpc_request_duration_seconds_bucket{method="Login",le="+Inf"} 1523

方案 B:OpenTelemetry Metrics(通过 OTLP 推送)

use opentelemetry::{global, KeyValue};

fn record_metrics() {
    let meter = global::meter("mini-tarkov");

    // Counter
    let request_counter = meter.u64_counter("grpc.requests.total")
        .with_description("Total gRPC requests")
        .build();
    request_counter.add(1, &[KeyValue::new("method", "Login")]);

    // UpDownCounter (可增可减,相当于 Gauge)
    let player_gauge = meter.i64_up_down_counter("online.players")
        .build();
    player_gauge.add(1, &[]);   // 玩家上线
    player_gauge.add(-1, &[]);  // 玩家下线

    // Histogram
    let latency = meter.f64_histogram("grpc.request.duration")
        .with_unit("s")
        .build();
    latency.record(0.023, &[KeyValue::new("method", "Login")]);
}

该埋哪些指标(游戏服务端)

RED 方法(面向请求的服务):
  Rate:     grpc_requests_total            每秒请求数
  Errors:   grpc_errors_total              每秒错误数
  Duration: grpc_request_duration_seconds  请求延迟分布

USE 方法(面向资源):
  Utilization: cpu_usage_percent, memory_usage_bytes
  Saturation:  db_pool_pending_connections, task_queue_length
  Errors:      db_errors_total, redis_errors_total

游戏专有:
  online_players                   在线玩家数
  matches_active                   当前进行中的对局数
  game_tick_duration_seconds       tick 耗时(超过 tick 间隔就是服务端掉帧)
  packet_loss_ratio                丢包率
  player_rtt_seconds               玩家 RTT 分布

7. 三者串联

① Grafana 看板上 grpc_errors_total 突然飙升 (Metrics)

② 点进去看时间段,找到异常的 trace_id (Metrics → Traces)

③ Jaeger 中打开这个 trace,看到 pg_query span 耗时 5s (Traces)

④ 用 trace_id 在 Loki 中搜索日志 (Traces → Logs)
   → "ERROR: deadlock detected" + 完整堆栈

⑤ 定位到两个事务并发更新同一行导致死锁

关联的关键:trace_id 贯穿三者。日志里带 trace_id 字段,Grafana 可以从 Metrics → Traces → Logs 一键跳转。

tracing + tracing-opentelemetry 自动在日志中注入当前 span 的 trace_id:

{
  "timestamp": "2026-05-20T10:23:45Z",
  "level": "ERROR",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "target": "mini_tarkov_server::handlers",
  "message": "database query failed",
  "error": "deadlock detected"
}

8. 完整初始化代码

use opentelemetry::global;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::{trace::SdkTracerProvider, Resource};
use tracing_opentelemetry::OpenTelemetryLayer;
use tracing_subscriber::{fmt, EnvFilter, layer::SubscriberExt, util::SubscriberInitExt};
use metrics_exporter_prometheus::PrometheusBuilder;

fn init_observability() {
    // --- Metrics: Prometheus exporter ---
    PrometheusBuilder::new()
        .with_http_listener(([0, 0, 0, 0], 9000))
        .install()
        .expect("failed to install metrics exporter");

    // --- Traces: OTLP → Jaeger/Tempo ---
    let exporter = opentelemetry_otlp::SpanExporter::builder()
        .with_tonic()
        .with_endpoint("http://localhost:4317")
        .build()
        .expect("failed to build span exporter");

    let resource = Resource::builder()
        .with_service_name("mini-tarkov-server")
        .build();

    let tracer_provider = SdkTracerProvider::builder()
        .with_batch_exporter(exporter)
        .with_resource(resource)
        .build();
    global::set_tracer_provider(tracer_provider.clone());

    let otel_layer = OpenTelemetryLayer::new(tracer_provider.tracer("mini-tarkov"));

    // --- Logs: 控制台 JSON ---
    tracing_subscriber::registry()
        .with(EnvFilter::try_from_default_env()
            .unwrap_or_else(|_| EnvFilter::new("info")))
        .with(fmt::layer().json())
        .with(otel_layer)
        .init();
}

#[tokio::main]
async fn main() {
    init_observability();
    tracing::info!("observability initialized");
    // ... 启动 gRPC server ...
}

9. Docker Compose:后端基础设施

services:
  # 你的游戏服务端
  server:
    build: .
    ports:
      - "50051:50051" # gRPC
      - "9000:9000" # Prometheus metrics
    environment:
      RUST_LOG: info
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317

  # OpenTelemetry Collector(统一收集转发)
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317" # OTLP gRPC
      - "4318:4318" # OTLP HTTP
    volumes:
      - ./otel-config.yaml:/etc/otelcol-contrib/config.yaml

  # Traces 存储
  tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"

  # Metrics 存储
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  # Logs 存储
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  # 日志采集器
  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - ./promtail.yml:/etc/promtail/config.yml

  # 统一看板
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
# prometheus.yml
scrape_configs:
  - job_name: mini-tarkov-server
    scrape_interval: 15s
    static_configs:
      - targets: ["server:9000"]
浏览器打开 http://localhost:3000 (Grafana)
  → 添加数据源: Prometheus (http://prometheus:9090)
  → 添加数据源: Tempo (http://tempo:3200)
  → 添加数据源: Loki (http://loki:3100)
  → 创建看板 / 导入社区模板

分享这篇文章:

上一篇
axum(三):中间件与生产实践-tower 原生的 Web 应用
下一篇
axum(二):Handler 与 Extractor-编译期请求解析的魔法