Inference monitoring

Request usage logging

OpenGateLLM tracks inference activity by storing usage data for each API request. This monitoring helps you analyze model usage over time, identify consumption patterns, and support reporting needs. Usage monitoring is backed by PostgreSQL and can be enabled through the configuration file. Once activated, requests are recorded in the usage table and can be explored from the Playground Usage page or queried directly from the database.

The logs contain the following information:

user ID
router ID
provider ID
number of input tokens
number of output tokens
environmental footprint (see the dedicated documentation here)
cost (see the dedicated documentation here)
duration
timestamp

Sensitive information such as the prompt or response content is not included in the logs.

Configuration

To logs requests for usage monitoring, set monitoring_postgres_enabled to true in settings (enabled by default).

settings:
    [...]
    monitoring_postgres_enabled: true

Configuration file documentation

Model health monitoring

OpenGateLLM provides a health check endpoint to monitor the health of the models. This endpoint is available at /health/models and returns a JSON response with the health status of the models.

{
  "data": [
    {
      "id": "model_name",
      "status": "green" | "yellow" | "red"
    }
  ]
}

The endpoint requires authentication. It only returns models (routers) the calling user is allowed to access.

Status values

Each model is assigned one of three statuses:

Status	Meaning
`green`	The provider responds and queue depth is within normal bounds.
`yellow`	The provider responds but is under moderate load.
`red`	The provider is unreachable, metrics are unavailable, or queue depth indicates severe degradation.

The default status is green. Status only escalates during the check; it is never downgraded back to green once a worse condition is detected.

How status is computed

Health is evaluated per model (router). A model can have several providers; the model status is the worst status among its providers (red > yellow > green).

For each provider attached to the model, OpenGateLLM probes the inference backend:

Metrics-capable providers (vLLM and on-prem Mistral): query the provider /metrics endpoint (Prometheus format) and read vllm:num_requests_waiting and vllm:num_requests_running for the configured model name.
Other providers: /metrics is not supported yet. OpenGateLLM falls back to /v1/models instead. If that endpoint returns a successful response, the provider is considered green. If it fails, the provider is red.

In all cases, a failed or unparseable metrics response sets the provider to red.

Load-test thresholds

Queue-depth thresholds were calibrated against load tests targeting the following service conditions:

Time to first token (TTFT) < 5 seconds
Throughput > 30 tokens/second

Under load, these targets correlate with the yellow and red boundaries below:

Status	Observed degradation (load tests)
`yellow`	p95 TTFT reaches 5 seconds, or throughput drops below 31 tokens/second
`red`	TTFT reaches 40 seconds, or throughput drops below 27 tokens/second

The health check does not measure TTFT or throughput directly. It uses waiting and running request counts from provider metrics as proxies for these conditions.

Provider-specific rules

Condition	Status
`/metrics` unavailable or invalid	`red`
`num_requests_waiting` > 0	`yellow`
`num_requests_running` > 20	`red`

Condition	Status
`/metrics` unavailable or invalid	`red`
`num_requests_running` > 58	`yellow`
`num_requests_running` > 63	`red`

Condition	Status
`/v1/models` unavailable	`red`
`/v1/models` responds successfully	`green`