Table of contents
Open Table of contents
1. 三支柱分别解决什么问题
出了问题
│
┌──────────────┼──────────────┐
▼ ▼ ▼
Metrics Traces Logs
"有问题" "哪里慢" "为什么错"
请求量突然掉了 这个请求经过了 具体的报错信息
错误率升到5% 哪些服务/函数 堆栈、上下文
P99延迟300ms 每步花了多久 事件时间线
| 支柱 | 回答的问题 | 数据特征 | 典型工具 |
|---|---|---|---|
| Metrics | 系统整体健不健康? | 数值聚合,时间序列 | Prometheus + Grafana |
| Traces | 某个请求慢在哪一步? | 分布式调用链,span 树 | Jaeger / Tempo |
| Logs | 某个错误具体怎么回事? | 离散的文本/结构化事件 | Loki / Elasticsearch |
三者互相关联:Metrics 告警 → Traces 定位到具体请求 → Logs 看详细错误。
2. Rust 生态的工具链
你的 Rust 代码
│
│ 埋点 API
▼
┌────────────┐
│ tracing │ Rust 统一的 instrumentation 框架
│ crate │ 同时产生 logs + spans(traces 的基础)
└─────┬──────┘
│ subscriber / layer 机制
▼
┌──────────────────────────────────────────────┐
│ Subscriber Layers │
│ │
│ tracing-subscriber → 控制台/文件 日志输出 │
│ tracing-opentelemetry → 转换为 OTel spans │
│ tracing-appender → 日志文件轮转 │
│ metrics crate → 指标采集 │
└──────────┬───────────────────────────────────┘
│ OTLP (OpenTelemetry Protocol)
▼
┌────────────────────────────┐
│ OpenTelemetry Collector │ 统一收集 → 转发
│ (可选,也可直接导出) │
└──────┬─────────┬─────────┬─┘
▼ ▼ ▼
Prometheus Jaeger Loki ← 存储后端
│ │ │
└─────────┴─────────┘
│
Grafana ← 统一看板
3. Cargo.toml 依赖
[dependencies]
# --- 核心 ---
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
# --- OpenTelemetry (Traces + Metrics) ---
opentelemetry = "0.29"
opentelemetry_sdk = { version = "0.29", features = ["rt-tokio"] }
opentelemetry-otlp = { version = "0.29", features = ["grpc-tonic"] }
tracing-opentelemetry = "0.30"
# --- Metrics (Prometheus 暴露) ---
metrics = "0.24"
metrics-exporter-prometheus = "0.16"
# --- 运行时 ---
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
4. Logs(日志)
tracing 的日志能力
tracing 不只是日志库——它是一个 structured diagnostics 框架。info!() 等宏产生的是带结构化字段的 event。
use tracing::{info, warn, error, debug, trace};
// 基本用法(和 log 库一样的体验)
info!("server started on port {}", 50051);
warn!("client {} failed to authenticate", client_id);
error!("database connection lost");
// 结构化字段(关键区别:字段可被检索和过滤)
info!(
player_id = %player.id,
action = "login",
latency_ms = elapsed.as_millis() as u64,
"player logged in"
);
// 输出 JSON 模式:
// {"timestamp":"...","level":"INFO","player_id":"p_001","action":"login","latency_ms":23,"message":"player logged in"}
#[instrument]:自动为函数生成 span + 日志
use tracing::instrument;
#[instrument(skip(pool))] // skip 不可 Debug 的参数
async fn get_player(pool: &PgPool, player_id: i32) -> Result<Player, AppError> {
info!("fetching player from database");
let player = sqlx::query_as("SELECT * FROM players WHERE id = $1")
.bind(player_id)
.fetch_one(pool)
.await?;
Ok(player)
}
// 自动生成:
// span: get_player{player_id=42}
// event: INFO "fetching player from database"
// 如果返回 Err,自动记录 error event
Subscriber 配置
use tracing_subscriber::{fmt, EnvFilter, layer::SubscriberExt, util::SubscriberInitExt};
fn init_logging() {
tracing_subscriber::registry()
// 日志级别过滤(支持环境变量 RUST_LOG)
.with(EnvFilter::try_from_default_env()
.unwrap_or_else(|_| EnvFilter::new("info,mini_tarkov_server=debug")))
// 输出到控制台
.with(fmt::layer()
.json() // JSON 格式(生产环境推荐)
.with_target(true) // 显示模块路径
.with_thread_ids(true)
.with_span_events(fmt::format::FmtSpan::CLOSE)) // span 关闭时打印耗时
.init();
}
# 运行时通过环境变量控制日志级别
RUST_LOG=debug cargo run
RUST_LOG=mini_tarkov_server::handlers=trace,sqlx=warn cargo run
日志输出到文件(轮转)
use tracing_appender::rolling;
fn init_file_logging() {
// 每天一个日志文件
let file_appender = rolling::daily("/var/log/mini_tarkov", "server.log");
let (non_blocking, _guard) = tracing_appender::non_blocking(file_appender);
// _guard 必须持有,drop 后日志写入停止
tracing_subscriber::registry()
.with(EnvFilter::new("info"))
// 文件输出 JSON
.with(fmt::layer()
.json()
.with_writer(non_blocking))
// 控制台输出人类可读
.with(fmt::layer()
.pretty())
.init();
}
日志收集到 Loki
应用 (JSON stdout) → Docker 日志驱动 → Promtail → Loki → Grafana
或者:
应用 → tracing-opentelemetry → OTel Collector → Loki
日志不需要额外的 SDK,容器化环境下 stdout JSON 输出 + Promtail 采集是最简单的方案。
5. Traces(分布式追踪)
核心概念
一个 gRPC 请求的 trace:
Trace (trace_id: abc123)
│
├─ Span: gRPC Login [0ms ─────────── 85ms]
│ │ service: mini_tarkov_server
│ │ rpc.method: Login
│ │
│ ├─ Span: validate_credentials [5ms ──── 20ms]
│ │ │ db.system: postgresql
│ │ │ db.statement: SELECT * FROM players...
│ │ │
│ │ └─ Span: pg_query [8ms ── 18ms]
│ │ db.rows_affected: 1
│ │
│ ├─ Span: generate_token [22ms ── 25ms]
│ │
│ └─ Span: cache_session [26ms ── 30ms]
│ net.peer.name: redis
│ db.operation: SET
│
└─ 85ms total
Trace = 一个请求的完整调用链(用 trace_id 关联)。 Span = 调用链中的一步操作(有开始时间、结束时间、父子关系、附加字段)。
接入 OpenTelemetry Traces
use opentelemetry::global;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::{trace::SdkTracerProvider, Resource};
use tracing_opentelemetry::OpenTelemetryLayer;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};
fn init_tracing() {
// OTLP exporter(发送 spans 到 Jaeger / Tempo / OTel Collector)
let exporter = opentelemetry_otlp::SpanExporter::builder()
.with_tonic()
.with_endpoint("http://localhost:4317")
.build()
.expect("failed to build span exporter");
let resource = Resource::builder()
.with_service_name("mini-tarkov-server")
.build();
let tracer_provider = SdkTracerProvider::builder()
.with_batch_exporter(exporter)
.with_resource(resource)
.build();
global::set_tracer_provider(tracer_provider.clone());
let otel_layer = OpenTelemetryLayer::new(tracer_provider.tracer("mini-tarkov"));
tracing_subscriber::registry()
.with(tracing_subscriber::EnvFilter::new("info"))
.with(tracing_subscriber::fmt::layer()) // 控制台日志
.with(otel_layer) // OTel traces
.init();
}
在代码中产生 Span
use tracing::{instrument, info, Span};
// 方式1: #[instrument] 宏(最常用)
#[instrument(skip(pool, redis))]
async fn handle_login(
pool: &PgPool,
redis: &redis::Client,
username: String,
password: String,
) -> Result<LoginResponse, AppError> {
let player = validate_credentials(pool, &username, &password).await?;
let token = generate_token(&player);
cache_session(redis, &token, player.id).await?;
Ok(LoginResponse { token, player_id: player.id })
}
// 方式2: 手动创建 span(更灵活)
async fn process_game_tick(tick: u32) {
let span = tracing::info_span!("game_tick", tick, player_count = tracing::field::Empty);
let _guard = span.enter();
let count = update_all_players().await;
Span::current().record("player_count", count);
broadcast_snapshot().await;
}
// 跨服务传播 trace context:
// tonic 的 gRPC metadata 自动携带 traceparent header
// 下游服务解析后,span 自动挂在同一个 trace 下
6. Metrics(指标)
四种指标类型
Counter (计数器):
只增不减。如:总请求数、总错误数
requests_total = 15234
Gauge (仪表盘):
可增可减。如:当前连接数、内存使用量、在线玩家数
connected_players = 42
Histogram (直方图):
记录值的分布。如:请求延迟分布
自动算 P50/P90/P99
request_duration_seconds{quantile="0.99"} = 0.25
Summary:
类似 Histogram,在客户端计算分位数(不推荐,用 Histogram)
方案 A:metrics crate + Prometheus exporter
最简单的方案,暴露一个 /metrics HTTP 端点给 Prometheus 抓取。
use metrics::{counter, gauge, histogram};
use metrics_exporter_prometheus::PrometheusBuilder;
fn init_metrics() {
// 在 0.0.0.0:9000 暴露 /metrics 端点
PrometheusBuilder::new()
.with_http_listener(([0, 0, 0, 0], 9000))
.install()
.expect("failed to install Prometheus exporter");
}
// 在业务代码中使用
async fn handle_request(method: &str) {
let start = std::time::Instant::now();
// Counter: 请求总数
counter!("grpc_requests_total", "method" => method.to_string()).increment(1);
// Gauge: 在线玩家数
gauge!("online_players").set(get_online_count() as f64);
// ... 业务逻辑 ...
// Histogram: 请求延迟
let duration = start.elapsed().as_secs_f64();
histogram!("grpc_request_duration_seconds", "method" => method.to_string())
.record(duration);
}
# 验证 metrics 端点
curl http://localhost:9000/metrics
# 输出:
# grpc_requests_total{method="Login"} 1523
# grpc_requests_total{method="GetInventory"} 8234
# online_players 42
# grpc_request_duration_seconds_bucket{method="Login",le="0.01"} 1200
# grpc_request_duration_seconds_bucket{method="Login",le="0.05"} 1480
# grpc_request_duration_seconds_bucket{method="Login",le="0.1"} 1520
# grpc_request_duration_seconds_bucket{method="Login",le="+Inf"} 1523
方案 B:OpenTelemetry Metrics(通过 OTLP 推送)
use opentelemetry::{global, KeyValue};
fn record_metrics() {
let meter = global::meter("mini-tarkov");
// Counter
let request_counter = meter.u64_counter("grpc.requests.total")
.with_description("Total gRPC requests")
.build();
request_counter.add(1, &[KeyValue::new("method", "Login")]);
// UpDownCounter (可增可减,相当于 Gauge)
let player_gauge = meter.i64_up_down_counter("online.players")
.build();
player_gauge.add(1, &[]); // 玩家上线
player_gauge.add(-1, &[]); // 玩家下线
// Histogram
let latency = meter.f64_histogram("grpc.request.duration")
.with_unit("s")
.build();
latency.record(0.023, &[KeyValue::new("method", "Login")]);
}
该埋哪些指标(游戏服务端)
RED 方法(面向请求的服务):
Rate: grpc_requests_total 每秒请求数
Errors: grpc_errors_total 每秒错误数
Duration: grpc_request_duration_seconds 请求延迟分布
USE 方法(面向资源):
Utilization: cpu_usage_percent, memory_usage_bytes
Saturation: db_pool_pending_connections, task_queue_length
Errors: db_errors_total, redis_errors_total
游戏专有:
online_players 在线玩家数
matches_active 当前进行中的对局数
game_tick_duration_seconds tick 耗时(超过 tick 间隔就是服务端掉帧)
packet_loss_ratio 丢包率
player_rtt_seconds 玩家 RTT 分布
7. 三者串联
① Grafana 看板上 grpc_errors_total 突然飙升 (Metrics)
│
② 点进去看时间段,找到异常的 trace_id (Metrics → Traces)
│
③ Jaeger 中打开这个 trace,看到 pg_query span 耗时 5s (Traces)
│
④ 用 trace_id 在 Loki 中搜索日志 (Traces → Logs)
→ "ERROR: deadlock detected" + 完整堆栈
│
⑤ 定位到两个事务并发更新同一行导致死锁
关联的关键:trace_id 贯穿三者。日志里带 trace_id 字段,Grafana 可以从 Metrics → Traces → Logs 一键跳转。
tracing + tracing-opentelemetry 自动在日志中注入当前 span 的 trace_id:
{
"timestamp": "2026-05-20T10:23:45Z",
"level": "ERROR",
"trace_id": "abc123def456",
"span_id": "789xyz",
"target": "mini_tarkov_server::handlers",
"message": "database query failed",
"error": "deadlock detected"
}
8. 完整初始化代码
use opentelemetry::global;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::{trace::SdkTracerProvider, Resource};
use tracing_opentelemetry::OpenTelemetryLayer;
use tracing_subscriber::{fmt, EnvFilter, layer::SubscriberExt, util::SubscriberInitExt};
use metrics_exporter_prometheus::PrometheusBuilder;
fn init_observability() {
// --- Metrics: Prometheus exporter ---
PrometheusBuilder::new()
.with_http_listener(([0, 0, 0, 0], 9000))
.install()
.expect("failed to install metrics exporter");
// --- Traces: OTLP → Jaeger/Tempo ---
let exporter = opentelemetry_otlp::SpanExporter::builder()
.with_tonic()
.with_endpoint("http://localhost:4317")
.build()
.expect("failed to build span exporter");
let resource = Resource::builder()
.with_service_name("mini-tarkov-server")
.build();
let tracer_provider = SdkTracerProvider::builder()
.with_batch_exporter(exporter)
.with_resource(resource)
.build();
global::set_tracer_provider(tracer_provider.clone());
let otel_layer = OpenTelemetryLayer::new(tracer_provider.tracer("mini-tarkov"));
// --- Logs: 控制台 JSON ---
tracing_subscriber::registry()
.with(EnvFilter::try_from_default_env()
.unwrap_or_else(|_| EnvFilter::new("info")))
.with(fmt::layer().json())
.with(otel_layer)
.init();
}
#[tokio::main]
async fn main() {
init_observability();
tracing::info!("observability initialized");
// ... 启动 gRPC server ...
}
9. Docker Compose:后端基础设施
services:
# 你的游戏服务端
server:
build: .
ports:
- "50051:50051" # gRPC
- "9000:9000" # Prometheus metrics
environment:
RUST_LOG: info
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
# OpenTelemetry Collector(统一收集转发)
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
volumes:
- ./otel-config.yaml:/etc/otelcol-contrib/config.yaml
# Traces 存储
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200"
# Metrics 存储
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
# Logs 存储
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
# 日志采集器
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- ./promtail.yml:/etc/promtail/config.yml
# 统一看板
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
# prometheus.yml
scrape_configs:
- job_name: mini-tarkov-server
scrape_interval: 15s
static_configs:
- targets: ["server:9000"]
浏览器打开 http://localhost:3000 (Grafana)
→ 添加数据源: Prometheus (http://prometheus:9090)
→ 添加数据源: Tempo (http://tempo:3200)
→ 添加数据源: Loki (http://loki:3100)
→ 创建看板 / 导入社区模板