janaka.dev

(Note 10 June 2020) Tracing in .NET and AWS App Mesh with OpenTelemetry

10 June 2020

As of this writing OpenTelemetry is still in beta. There isn’t a lot of writing about it yet. Official docs are minimal, especially if you are in .NET Core land. Now throw AWS App Mesh and Fargate into the mix and we are in for a fun ride. Here are notes on what I learnt by messing around with it in a sandbox.

  • Ideally we would want to use Otlp (OpenTelemetry Protocol) as the export format in the app. But Envoy doesn’t support that yet.
  • Envoy tracing config changes in newer Envoy versions. App Mesh Envoy is about 2 versions behind right now (1.12.3).
  • Managing the Envoy tracing config like this isn’t ideal. Envoy version upgrades will involve changes to the tracing config. Appreciate the App Mesh team providing this escape hatch but I’d hope they add first class support. Not having to manually manage Envoy config is one of the main benefits of App Mesh.
  • The Otel .NET SDK is very immature and APIs are very unstable. They are upfront about this. It’s still in Alpha. It’s also behind the Golang and maybe Java SDK. The differences in Zipkin vs Otlp are shown below in an example. It sounds like there’ll be a significant refactor soon and some bits getting pushed into .NET BCL. Aaron from Petabridge wrote more about all this here and concluded that it’s about a year out still.
  • The pipeline architecture of the collector/agent is really nice. Allows one to config multiple pipelines of receivers, processors, and exporters. This makes it easy to send traces to multiple backends and options for switching out. Makes vendor evaluation easier and cheaper. Without cost should also be low.
  • Agent vs Collector - the naming is poor, they are both collectors by function. The name describes the mode it runs in.

    • Agent: runs next to the app e.g. sidecar. This will be the core collector. See code repo. It supports only vendor agnostic protocols like Zipkin or OTLP (OpenTelemetry Protocol).
    • Collector: Runs stand-alone and aggregates traces from several agents then sends to a backend like Zipkin (self-hosted) or LightStep (cloud). This will be the collector with vendor protocol contributions. See code repo.
    • Official docs say “Agent is capable of adding metadata” and “can offer advanced capabilities over the Agent including tail-based sampling”. This implies that there are functional differences between the two binaries.
    • See the official explanation here.

High-level view

image

The agent + collector setup is recommended by OTel.

Important Config Snippets

I’ll link to the working example source code in the future.

startup.cs for Zipkin protocol.

This code propagates trace IDs on outbound calls

using OpenTelemetry.Trace.Configuration; 

public void ConfigureServices(IServiceCollection services)
{
    services.AddControllers();

    services.AddOpenTelemetry((sp, builder) =>
    {
        builder
            //.SetSampler(Samplers.AlwaysSample)
            
            .UseZipkin(options =>
            {
                options.ServiceName = "hello-world-api";
                options.Endpoint = new Uri(zipkinEndpoint);
            })
            .AddRequestAdapter()
            .AddDependencyAdapter(); 
    });
}

At present the API for Otlp is a little different.

  // ref https://github.com/open-telemetry/opentelemetry-dotnet/blob/master/src/OpenTelemetry.Exporter.OpenTelemetryProtocol/TracerBuilderExtensions.cs
  services.AddOpenTelemetry((sp, builder) =>
   {
       builder
           .SetResource(Resources.CreateServiceResource("hello-world-api"))
           .UseOpenTelemetryProtocolExporter (options =>
           {
               options.Credentials = ChannelCredentials.Insecure;
               options.Endpoint = otlpEndpoint;
           })
           .AddRequestAdapter()
           .AddDependencyAdapter(); 
  });

Create a custom App Mesh Envoy image with the override config because it’s the easiest way on Fargate. Not ideal for production.

Envoy (v1.12.3) override config envoy-otel-tracer-config.yaml

tracing:
  http:
    name: envoy.zipkin
    typed_config:
      "@type": type.googleapis.com/envoy.config.trace.v2.ZipkinConfig
      collector_cluster: otel-agent
      collector_endpoint: "/api/v1/spans"
      #collector_endpoint: "/api/v2/spans"
      #collector_endpoint_version: HTTP_JSON

static_resources:
  clusters:
  - name: otel-agent
    connect_timeout: 1s
    type: strict_dns
    lb_policy: round_robin
    load_assignment:
      cluster_name: otel-agent
      endpoints:
      - lb_endpoints:
        - endpoint:
           address:
            socket_address:
             address: 127.0.0.1
             port_value: 9411
FROM 840364872350.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.12.3.0-prod
COPY envoy-otel-tracer-config.yaml /tmp/envoy-otel-tracer-config.yaml

Env var that tells the App Mesh Envoy container to use the tracing override config file. The complete task def has 3 containers defs: 1) the app 2) App Mesh Envoy 3) OTel agent.

  {
    "name": "ENVOY_TRACING_CFG_FILE",
    "value": "/tmp/envoy-otel-tracer-config.yaml"
  }

Personal notes and thoughts on web technology, software development, and technical product management by Janaka Abeywardhana. On Github, Twitter, and Instagram

© 2020-2022, Built with Gatsby