Logging and OpenTelemetry in the Grafbase Gateway

The Grafbase Gateway provides logs, traces, and metrics for monitoring gateway operations and errors. By default, it outputs logs to standard output. Additionally, the gateway can send monitoring data to an endpoint that implements the OpenTelemetry protocols.

You can define the level of information by setting the log level command line argument:

--log <LOG_LEVEL> Set the logging level, this applies to all spans, logs and trace events. Beware that *only* 'off', 'error', 'warn' and 'info' can be used safely in production. More verbose levels, such as 'debug', will include sensitive information like request variables, responses, etc. Possible values are: 'off', 'error', 'warn', 'info', 'debug', 'trace' or a custom string. In the last case, the string is passed on to [`tracing_subscriber::EnvFilter`] as is and is only meant for debugging purposes. No stability guarantee is made on the format. [env: GRAFBASE_LOG=] [default: info]

This setting affects both traces and logs. The default level is info. debug and trace will include sensitive details and should not be used in production.

If you want to silence all logs but still export them along with traces and metrics to an OpenTelemetry endpoint, direct standard output and standard error to /dev/null.

The gateway can return the query plan and trace id in the GraphQL response extensions under grafbase for Pathfinder, our GraphQL query tool. It is enabled by default and is returned whenever x-grafbase-telemetry request header is present:

# Default configuration [telemetry.exporters.response_extension] # Whether the traceId is exposed trace_id = true # Whether the queryPlan is exposed query_plan = true # Defines who can access the grafbase response extension. [[telemetry.exporters.response_extension.access_control]] rule = "header" name = "x-grafbase-telemetry"

Access can be denied for everyone with:

[[telemetry.exporters.response_extension.access_control]] rule = "deny"

It is also possible to require a specific value for the header. Only requests with right header value will have the grafbase extension, all others won't.

[[telemetry.exporters.response_extension.access_control]] rule = "header" name = "x-grafbase-telemetry" value = "must-be-this-value"

Environment variables can be used to parameterize the configuration:

[[telemetry.exporters.response_extension.access_control]] rule = "header" name = "{{ env.HEADER_NAME }}" value = "{{ env.SECRET }}"

By default, the system outputs logs to standard output. Logs can appear in two different formats:

--log-style <LOG_STYLE> Set the style of log output [env: GRAFBASE_LOG_STYLE=] [default: pretty] Possible values: - pretty: Pretty printed logs, used as the default in the terminal - text: Standard text, used as the default when piping stdout to a file - json: JSON objects

The default style is pretty, inside a terminal, which provides ANSI-colored text for terminal output and a human-friendly formatting. When piping to a file, text will be used instead.The json format delivers logs in JSON format, which can be useful if the logging platform supports structured data.

Logs can also be sent to an OpenTelemetry endpoint by enabling the OpenTelemetry exporter in the configuration:

[telemetry.exporters.otlp] enabled = true endpoint = "http://localhost:1234"

You can send logs to a different endpoint than the global OpenTelemetry settings:

[telemetry.logs.exporters.otlp] enabled = true endpoint = "http://localhost:1235"

Read more about OpenTelemetry options in the configuration section.

Grafbase Gateway monitors the request lifecycle by providing traces. When you supply a valid access token in the GRAFBASE_ACCESS_TOKEN environment variable, the system automatically sends traces to the Grafbase Dashboard or Grafbase Enterprise platform. The dashboard only displays traces from the Grafbase Gateway. To send traces to a different OpenTelemetry endpoint, configure it in the configuration file. A third-party telemetry platform allows you to combine traces from the gateway with other services in your platform. Traces provide information on the request lifecycle and send data to the OpenTelemetry endpoint from the info level.

You can change settings for tracing in the gateway configuration:

[telemetry.tracing] sampling = 1 parent_based_sampler = false
  • sampling: Defines the percentage of requests to trace (a floating point from 0 to 1). Set this to 1 for testing purposes or with low traffic. For high traffic, sampling every request can be expensive for network, CPU, and storage. (default: 0.15)
  • parent_based_sampler: Enables the parent based sampler mechanism. When enabled, the gateway looks at the request headers to make trace sampling decisions. It falls back to its default sampling strategy when the request doesn't specify a sampling strategy. This option is disabled by default. Only enable it if you control all the clients, because malicious actors could create more load by manipulating sampling. (default: false)
[telemetry.tracing.collect] max_events_per_span = 128 max_attributes_per_span = 128 max_links_per_span = 128 max_attributes_per_event = 128 max_attributes_per_link = 128
  • max_events_per_span: Maximum number of events recorded per span (default: 128)
  • max_attributes_per_span: Maximum number of attributes recorded per span (default: 128)
  • max_links_per_span: Maximum number of links recorded per span (default: 128)
  • max_attributes_per_event: Maximum number of attributes one event can have (default: 128)
  • max_attributes_per_link: Maximum number of attributes one link can have (default: 128)

The propagation options determine how the router propagates tracing context (trace id, parent span id, and extra context) when it receives requests and passes them to subgraphs. The router supports multiple common standards. Contact us if you need support for additional formats.

Propagation enables you to link spans created in the gateway with spans created in other services. This helps you debug and monitor the entire request lifecycle. The Grafbase dashboard doesn't propagate traces. To propagate traces, define an additional OpenTelemetry endpoint in the gateway configuration.

[telemetry.tracing.propagation] trace_context = true baggage = true aws_xray = false
  • trace_context: Enable TraceContext propagation through the traceparent header. This is the standard trace parent propagation mechanism in OpenTelemetry. Default: false.
  • baggage: Enable Baggage context propagation through the baggage header. This is the standard context propagation mechanism in OpenTelemetry. Default: false.
  • aws_xray: Enable AWS X-Ray propagation through the x-amzn-trace-id header. This is the builtin trace propagation mechanism in AWS X-Ray.

Enable the OpenTelemetry exporter in the configuration to send traces to an additional OpenTelemetry endpoint:

[telemetry.exporters.otlp] enabled = true endpoint = "http://localhost:1234"

You can also send traces to a different endpoint than the global value:

[telemetry.tracing.exporters.otlp] enabled = true endpoint = "http://localhost:1235"

Read more about OpenTelemetry options in the OpenTelemetry configuration.

To write spans directly to standard output, turn on the global stdout exporter. This helps during evaluation and debugging:

[telemetry.exporters.stdout] enabled = true

To enable only tracing, use:

[telemetry.tracing.exporters.stdout] enabled = true

The Grafbase Gateway sends spans in certain points in the request execution.

All the spans will have the following default attributes:

  • busy_ns: Time the span remained active in nanoseconds.
  • code.filepath: Code file path.
  • code.lineno: Line number in the code where this span originated.
  • code.namespace: Module name.
  • idle_ns: Time the span remained idle in nanoseconds.
  • thread.id: Runtime thread ID.
  • thread.name: Runtime thread name.

Span name: <VERB> <PATH>.

The root span monitors the complete request lifecycle. Additional spans descend from this root span.

Attributes:

  • grafbase.kind: The span kind, which is always http-request for the root span.
  • graphql.operations.name: The name or names of the executed operation(s).
  • graphql.operations.type: The type or types of the executed operation(s).
  • graphql.response.errors.count: The number of errors in the response.
  • graphql.response.errors.count_by_code.codes: Distinct error codes in the response.
  • graphql.response.errors.count_by_code.counts: The number of errors for each distinct error code.
  • http.request.body.size: The size of the request body.
  • http.request.header.x-forwarded-for: The client IP address.
  • http.request.header.x-grafbase-client-name: The name of the client.
  • http.request.header.x-grafbase-client-version: The version of the client.
  • http.request.method: The HTTP method.
  • http.response.body.size: The size of the response body.
  • server.address: The server address.
  • url.path: The URL path.
  • user_agent.original: The user agent.

Span name: hook: on-gateway-request.

You'll only see this span if you define the on-gateway-request hook.

The span has two child spans:

  • get instance from the pool: Taking the instance from a pool.
  • call instance: Execution of the actual instance code.

Span name: authenticate.

You'll only see this span if you enable authentication.

Span name: rate limit.

You'll only see this span if you enable global rate limiting with Redis storage.

Span name: <OPERATION_NAME>.

Operation execution spans are created for each operation in the request.

Attributes;

  • grafbase.kind: The span kind, which is always graphql-operation for operation execution spans.
  • grafbase.operation.computed_name: The name of the operation. For named operations, this shows the operation name. For unnamed operations, this shows a name derived from the query.
  • graphql.operation.document: The normalized query that hides all possible data.
  • graphql.operation.type: The type of the operation: query, mutation or subscription.
  • graphql.response_data.is_present: Whether the response data is present.
  • graphql.response.errors.count_by_code: Distinct error codes in the response with their counts.

Span name: prepare operation.

This span plans the operation execution and operates as a child of the operation execution span.

Span name: hook: authorize-edge-pre-execution.

When edges use an @authorized directive with an arguments argument, the gateway shows this span under the operation execution span and repeats it on each accessed edge that has a matching directive. The span has two child spans:

The span has two child spans:

  • get instance from the pool: Taking the instance from a pool.
  • call instance: Execution of the actual instance code.

Span name: hook: authorize-node-pre-execution.

When nodes use an @authorized directive with an arguments argument, the gateway shows this span under the operation execution span and repeats it on each accessed node that has a matching directive. The span has two child spans:

The span has two child spans:

  • get instance from the pool: Taking the instance from a pool.
  • call instance: Execution of the actual instance code.

Span name: <SUBGRAPH_NAME>.

The gateway creates subgraph execution spans for each subgraph in the request. These spans become children of the operation execution span.

Attributes:

  • subgraph.name: The name of the subgraph.
  • grafbase.kind: The span kind, which is always subgraph-graphql-request for subgraph execution spans.
  • graphql.operation.document: The subgraph receives this query. The system replaces all data with variables to prevent exposure of sensitive data.
  • graphql.operation.type: The type of the operation: query, mutation or subscription.
  • graphql.response_data.is_present: Whether the response data is present.

Span name: hook: on-subgraph-request.

This span appears when you define the on-subgraph-request hook and operates as a child of the subgraph execution span.

The span has two child spans:

  • get instance from the pool: Taking the instance from a pool.
  • call instance: Execution of the actual instance code.

Span name: rate limit.

This span appears when you enable rate limiting to the subgraph with Redis storage.

Attributes:

  • subgraph.name: The name of the subgraph.

Span name: POST <PATH>.

This span tracks HTTP requests to the subgraph. When you enable retries and requests fail, the gateway creates multiple spans. All spans act as children of the subgraph execution span.

Attributes:

  • http.request.method: The HTTP method.
  • http.response.status_code: The HTTP status code.
  • server.address: The server address.
  • server.port: The server port.

Span name: hook: on-subgraph-response.

The gateway creates this span when you define the on-subgraph-response hook. This span acts as a child of the subgraph execution span.

The span has two child spans:

  • get instance from the pool: Taking the instance from a pool.
  • call instance: Execution of the actual instance code.

Span name: hook: on-operation-response.

The gateway creates this span when you define the on-subgraph-response hook. This span acts as a child of the operation execution span.

The span has two child spans:

  • get instance from the pool: Taking the instance from a pool.
  • call instance: Execution of the actual instance code.

Span name: hook: authorize-parent-edge-post-execution.

When edges use an @authorized directive with fields argument, the gateway shows this span under the root span and repeats it on each accessed edge that has a matching directive. The span has two child spans:

The span has two child spans:

  • get instance from the pool: Taking the instance from a pool.
  • call instance: Execution of the actual instance code.

Span name: hook: authorize-edge-node-post-execution.

When nodes use an @authorized directive with node argument, the gateway shows this span under the root span and repeats it on each accessed node that has a matching directive. The span has two child spans:

The span has two child spans:

  • get instance from the pool: Taking the instance from a pool.
  • call instance: Execution of the actual instance code.

Span name: hook: on-http-response.

The gateway creates this span when you define the on-http-response hook. This span acts as a child of the root span.

The span has two child spans:

  • get instance from the pool: Taking the instance from a pool.
  • call instance: Execution of the actual instance code.

The Grafbase Gateway delivers metrics for requests and operations to an OpenTelemetry endpoint. Metrics include counters, histograms, and gauges at various points in the system.

Enable the OpenTelemetry exporter in the configuration to send metrics to an OpenTelemetry endpoint:

[telemetry.exporters.otlp] enabled = true endpoint = "http://localhost:1234"

You can send metrics to a separate endpoint as well:

[telemetry.metrics.exporters.otlp] enabled = true endpoint = "http://localhost:1235"

Read more about OpenTelemetry options in the configuration section.

You can also write spans directly to standard output by enabling the global stdout exporter for evaluation and debugging:

[telemetry.exporters.stdout] enabled = true

Enable it only for metrics if needed:

[telemetry.metrics.exporters.stdout] enabled = true

The exponential histograms include a Count field, which doubles any histogram as a counter metric. If you can't find a specific counter, check if any of the histograms can serve that purpose.

Metric Name: http.server.request.duration

This exponential histogram measures the time in milliseconds for each HTTP request and helps you track the final response time for those requests. It includes the following attributes:

  • http.response.status_code: The HTTP status code.
  • http.request.method: The HTTP request method.
  • http.route: The request path.
  • network.protocol.version: The HTTP version of the request.
  • server.address: The server's listen address.
  • server.port: The server's listen port.
  • url.scheme: Either http or https, depending on whether TLS is enabled in the gateway.
  • http.headers.x-grafbase-client-name: The name of the client that triggered this request, if available.
  • http.headers.x-grafbase-client-version: The version of the client that triggered this request, if available.
  • graphql.response.status: Indicates whether the underlying GraphQL operation succeeded, if available.

Metric Name: http.server.connected.clients

This up/down counter tracks currently connected clients, incrementing on an incoming request and decrementing upon any response.

Metric Name: http.server.request.body.size

This exponential histogram measures request body sizes.

Metric Name: http.server.response.body.size

This exponential histogram measures response body sizes.

Metric Name: graphql.operation.duration

This exponential histogram measures the time in milliseconds for every valid operation in the GraphQL engine. The metric includes the following attributes:

  • graphql.document: The normalized query of this operation, stripped of all variables. This value cannot contain any private data.
  • graphql.operation.type: The type of the operation (either query, mutation, or subscription).
  • graphql.operation.name: The name of the operation, if provided.
  • graphql.response.status: Indicates if the response succeeded.
  • http.headers.x-grafbase-client-name: The name of the client that triggered this request, if available.
  • http.headers.x-grafbase-client-version: The version of the client that triggered this request, if available.

Metric Name: graphql.operation.errors

This counter tracks distinct GraphQL errors per request. The metric contains the following attributes:

  • graphql.response.error.code: The error code returned to the user.
  • graphql.operation.name: The name of the operation, if present.
  • http.headers.x-grafbase-client-name: The name of the client, if present.
  • http.headers.x-grafbase-client-version: The version of the client, if present.

Metric Name: graphql.operation.batch.size

This exponential histogram measures the number of batched requests sent to the engine. It counts the total number of batched requests while measuring the number of requests in the batch.

Metric Name: graphql.subgraph.request.duration

This exponential histogram measures the time in milliseconds for every subgraph request. It helps track execution time and includes the following attributes:

  • graphql.subgraph.name: The requested subgraph's name.
  • graphql.subgraph.response.status: Indicates if the response succeeded.
  • http.response.status_code: The HTTP status code.

Metric Name: graphql.subgraph.request.retries

This counter tracks retried subgraph requests. To enable this counter, you must enable retries. The counter increments when a subgraph request fails and the engine retries it. The metric includes the following attributes:

  • graphql.subgraph.name: The requested subgraph's name.
  • graphql.subgraph.aborted: Indicates if the retries stopped and if the request became an error.

Metric Name: graphql.subgraph.request.body.size

This exponential histogram measures subgraph request body sizes in bytes. The metric includes the following attribute:

  • graphql.subgraph.name: The requested subgraph's name.

Metric Name: graphql.subgraph.response.body.size

This exponential histogram measures successful subgraph response body sizes in bytes. The metric includes the following attribute:

  • graphql.subgraph.name: The requested subgraph's name.

Metric Name: graphql.subgraph.request.inflight

This up/down counter tracks in-flight subgraph requests. It increments when requesting a subgraph and decrements upon any response. The metric includes the following attribute:

  • graphql.subgraph.name: The requested subgraph's name.

Metric Name: graphql.subgraph.request.cache.hit

This counter tracks hits of subgraph entity caches. Enable this counter by activating entity caching. The metric includes the following attribute:

  • graphql.subgraph.name: The requested subgraph's name.

Metric Name: graphql.subgraph.request.cache.miss

This counter tracks misses of subgraph entity caches. Enable this counter by activating entity caching. The metric includes the following attribute:

  • graphql.subgraph.name: The requested subgraph's name.

Metric Name: graphql.operation.cache.hit

This counter tracks hits for operation plan caches.

Metric Name: graphql.operation.cache.miss

This counter tracks misses for operation plan caches.

Metric Name: graphql.operation.prepare.duration

This exponential histogram measures the time in milliseconds taken to prepare an operation. This includes:

  • Fetching a trusted document, if enabled and available.
  • Fetching a query plan from the in-memory cache.
  • If the plan is not cached, parsing the query into an AST and then determining the plan.

The metric includes the following attributes:

  • graphql.operation.name: The name of the operation, if present.
  • graphql.document: The normalized operation if parsing succeeds.
  • graphql.operation.success: Indicates if the preparation finished successfully.

Metric Name: grafbase.hook.duration

This exponential histogram measures the time in milliseconds taken to execute a hook. The metric includes the following attributes:

  • grafbase.name.hook: The name of the hook function.
  • grafbase.hook.status: Indicates if the hook call succeeded (SUCCESS), or if it failed due to errors from Grafbase code (HOST_ERROR), or from user code (GUEST_ERROR).

Metric Name: grafbase.hook.pool.instances.busy

This counter counts the number of active instances in the hook instance pool. Each instance processes one request at a time, which is why the gateway utilizes a pool to handle multiple requests concurrently. The metric includes the following attributes:

  • grafbase.hook.interface: The instantiated hook interface.

Metric Name: grafbase.gateway.access_log.pending

This counter measures the amount of access log events not yet written to the access log file. Read more on access logs.

Metric Name: grafbase.gateway.rate_limit.duration

This exponential histogram measures the time in milliseconds taken to query the current request rate from Redis. This metric requires enabling the Redis-based rate-limiting.

Metric Name: gdn.request.duration

This exponential histogram measures the time in milliseconds to fetch a graph from the Graph Delivery Network. This metric only activates in hybrid mode. The metric includes the following attributes:

  • server.address: The Graph Delivery Network endpoint URL.
  • gdn.response.kind: The response status kind, either new, unchanged, http_error, or gdn_error.
  • http.response.status_code: The status code of the request.

Define OpenTelemetry settings in the telemetry block of your Gateway configuration:

[telemetry] service_name = "grafbase-gateway"

The service_name appears in all traces, metrics, and logs and should be unique in your system.

Grafbase sends a standard set of resource attributes for every user. You can also define your own attributes, available in all logs, traces, and metrics:

[telemetry.resource_attributes] custom_key = "custom_value" other_key = "other_value"

You can define exporter settings globally for traces, logs, and metrics. If you need different settings for logs, tracing, or metrics, prefix the exporter settings with the appropriate word. For instance, custom settings for tracing use the key telemetry.tracing.exporters.otlp.

The traces and metrics can also be sent to standard output (logs will always be there):

[telemetry.exporters.stdout] enabled = true timeout = 60
  • enabled: Enables the OpenTelemetry exporter (default: false).
  • timeout: Time in seconds data remains in memory if the collector does not collect it promptly (default: 60).

Send traces, metrics, and logs to an external OpenTelemetry collector:

[telemetry.exporters.otlp] enabled = true endpoint = "http://localhost:1234" protocol = "grpc" timeout = 60
  • enabled: Enables the OpenTelemetry exporter (default: false).
  • endpoint: Defines the URL for the OpenTelemetry collector.
  • protocol: Either grpc or http (default: grpc).
  • timeout: Time in seconds data remains in memory if the collector does not collect it promptly (default: 60).

Avoid triggering a request for every single span, trace, and metric event. Instead, batch requests and send data at regular intervals. Configure the OpenTelemetry batch settings:

[telemetry.exporters.otlp.batch_export] scheduled_delay = 5 max_queue_size = 2048 max_export_batch_size = 512 max_concurrent_exports = 1
  • scheduled_delay: Time in seconds between consecutive requests (default: 5).
  • max_queue_size: Maximum queued items for delayed processing. If the queue fills, the system drops events (default: 2048).
  • max_export_batch_size: Maximum number of events in a single batch. If more events are collected before the scheduled delay, it queues them (default: 512).
  • max_concurrent_exports: Number of concurrent senders processing batches (default: 1).

If using grpc as the protocol, the Gateway will use the following settings.

For collectors using TLS with a custom certificate, specify the TLS settings:

[telemetry.exporters.otlp.grpc.tls] domain_name = "custom_name" key = "/path/to/key.pem" cert = "/path/to/cert.pem" ca = "/path/to/ca.crt"
  • domain_name: The domain name against which to verify the server's TLS certificate.
  • key: Path to the secret key.
  • cert: Path to the X509 certificate file in PEM format.
  • ca: Path to the X509 CA certificate file in PEM format.

If needed, define custom headers for gRPC collectors:

[[telemetry.exporters.otlp.grpc.headers]] authorization = "Bearer {{ env.GRPC_TOKEN }}" [[telemetry.exporters.otlp.grpc.headers]] custom = "static value"

If you set the protocol to http, the Gateway will use the following settings. Define custom headers to send with every request:

[[telemetry.exporters.otlp.http.headers]] authorization = "Bearer {{ env.GRPC_TOKEN }}" [[telemetry.exporters.otlp.http.headers]] custom = "static value"

Currently, the http exporter does not support TLS. If you need TLS, use the grpc exporter.