Logdriver Deployment

The Logdriver is deployed alongside the TSP and includes health check endpoints and optional configuration by environment variables to tune performance.

Configuration

Outside of the configuration mentioned in the startup section of the overview, there are several optional environment variables that allow for tuning. In general it is recommended that you don’t specify these (which will cause the container to use the default values) unless you are instructed to adjust them to resolve an issue.
  • LOGDRIVER_CHANNEL_CAPACITY. Default: 1000. Controls the number of messages that can be held in buffers between logdriver pipeline stages. Increasing this will have a memory impact.
  • LOGDRIVER_SINK_BATCH_SIZE. Default: 1000. Maximum number of events that can be bundled into a single batch call to a tenant’s logging system. Increasing this may slow down network calls to cloud logging sinks, but may allow for faster draining of high volume tenants' buffers.
  • LOGDRIVER_BUFFER_POLL_INTERVAL. Default: 2000. Interval (in milliseconds) between each reaping pass of tenant buffers. Decreasing this will increase data rate but may result in a buildup of uncompleted network calls that can eventually use up the container’s network resources.
  • LOGDRIVER_CONFIG_REFRESH_INTERVAL. Default: 600. Interval (in seconds) between each logdriver configuration cache refresh.
  • LOGDRIVER_CHANNEL_TIMEOUT. Default: 250. Time (in milliseconds) that pipeline channel sends are allowed before they are abandoned.
  • LOGDRIVER_EVENT_PRODUCER_URL. Default: “tcp://localhost:5555". Target from which to pull events.

Health and Liveness Checks

The Docker container also exposes endpoints for checking liveness and health of the container. The checks are implemented based on the Kubernetes lifecycle concepts. The exposed URLs and their meaning are
  • /health: Returns a 200 status code when the container is ready to accept requests. Returns a 500 status code when the server is shutting down or is still initializing.
  • /live: Returns a 200 status code when the container is not shutting down. Returns a 500 status code when the server is shutting down.
  • /ready: Returns a 200 status code when the container is ready to accept requests. Returns a 500 status code when the server is not ready to accept requests.
The container will not report as being “ready” until it has retrieved and decrypted the initial set of tenant logging configurations from the Configuration Broker. If the Logdriver is overloaded it will also report 500 NOT_READY until it is able to work through some of its logging backlog.
Each of these health endpoints are running on port 9001 within the Docker image.

Performance

Logging performance can be measured across related dimensions: per-tenant and global.

Per-Tenant Performance

Logdriver defines a maximal rate a single tenant can sustain. By default, this is 500 log events per second per tenant, although this is tunable via configuration. This limit is in place, primarily, to provide fairness between tenants and to avoid overwhelming any particular log sink.

Global Performance

Expected sustained performance should be 5000-7000 operations per second, depending on computing and network resources. Burst performance can be higher. Keep the per tenant fairness in mind when testing the Logdriver with a low number of tenants, because total throughput will be higher when a given test rate accounts for the 500 log events per second per tenant.

Resource Usage

Memory

There are two major factors related to memory usage of the container.
There is a flat amount of memory, around 500B per-tenant, used to store information about them needed to make logging calls.
Logdriver will also buffer a maximum of 350,000 events across all tenants in memory before it goes into a NOT_READY state and drops events that it receives. Events take roughly 2KB of memory each, so this section of the memory use is capped at around 700MB.
We recommend 1256MiB as a default that will cover most use cases.

CPU

After startup, CPU will primarily be used to to marshal events through to their log sinks, mostly batching and making asynchronous HTTP calls. Logdriver also needs CPU to decrypt logging configurations, which it pulls on a 10 minute schedule. In order to allow events to be processed and sent out while decryption is taking place, a minimum of 2 CPUs should be given to the container.
We recommend 2 CPUs as a default for most use cases.

Failure Modes

Failure to Deliver Log Messages

Sustained load in excess of the per-tenant or global guidelines will eventually cause the Logdriver service to start rejecting messages. Warnings will appear in the TSP and Logdriver’s logs. The TSP will continue to function, but the log messages will not be delivered to the tenant’s logging system.
This behavior can also be triggered by very large bursts of activity.
In this case, the Logdriver container’s readiness check (/ready) will start to return 500. Traffic should be directed to another TSP+Logdriver set until /ready returns a 200. Ensure that whatever orchestration you’re using doesn’t remove all TSPs from rotation this way, since the TSP does still function in this state with reduced logging reliability and slower response times.

Troubleshooting

File Descriptor Limits

File descriptor limit errors may show up if you have too much concurrent traffic flowing through a single Logdriver instance. If you notice these errors in the logs you should either increase the file descriptor limit on that Logdriver or horizontally scale to provide more Logdrivers to the TSP the overloaded one is serving.

Example Deployments

Example Docker Compose

We don’t recommend running a simple docker compose like this in production, but it is useful to see the basics of what is needed to run the Tenant Security Proxy (TSP) and Logdriver (LD) together. If you need a more robust production example, see the kubernetes example.
services:
  tenant-security-proxy:
    env_file:
      - ./config-broker-config.conf
    ports:
      - "7777:7777"
      - "9000:9000"
    image: tenant-security-proxy
    links:
      - tenant-security-logdriver
  tenant-security-logdriver:
    environment:
      - LOGDRIVER_EVENT_PRODUCER_URL=tcp://tenant-security-proxy:5555
    env_file:
      - ./config-broker-config.conf
    ports:
      - "9001:9001"
    image: tenant-security-logdriver
    volumes:
      - type: bind
        source: /tmp
        target: /logdriver

Example Kubernetes Deployment

When running the TSP and LD together, LD needs a persistent disk to store log messages, to ensure nothing gets lost. To give each LD its own disk, we use a StatefulSet with a volumeClaimTemplate. The StatefulSet is a bit more complicated than the Deployment we used for TSP alone.
Here’s the StatefulSet, with both its required headless service (tenant-security-proxy-sts) and the service that will be used by TSP clients (tenant-security-proxy).
# This is the client-facing service.
apiVersion: v1
kind: Service
metadata:
  name: tenant-security-proxy
  labels:
    app: tenant-security-proxy
spec:
  type: NodePort
  ports:
  - port: 7777
    targetPort: 7777
    name: http
  selector:
    app: tenant-security-proxy

---

# This is the headless service used by the StatefulSet to keep track of its replicas.
apiVersion: v1
kind: Service
metadata:
  name: tenant-security-proxy-sts
spec:
  ports:
    - port: 7777
      name: http
  clusterIP: None
  selector:
    app: tenant-security-proxy

---

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: tenant-security-proxy
spec:
  # We're not setting replicas here because that's controlled by a HorizontalPodAutoscaler.
  selector:
    matchLabels:
      app: tenant-security-proxy
  serviceName: tenant-security-proxy-sts
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: tenant-security-proxy
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '7777'
    spec:
      securityContext:
        runAsUser: 2 # Any non-root user will do.
        runAsGroup: 2
        fsGroup: 2
        runAsNonRoot: true
      containers:
      - name: tenant-security-proxy
        image: gcr.io/ironcore-images/tenant-security-proxy:{CHOSEN_TAG}
        resources:
          requests:
            cpu: 2
            memory: 1256Mi
          limits:
            cpu: 2
            memory: 1256Mi
        envFrom:
        - secretRef:
            # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup
            name: tsp-secrets
        env:
        - name: RUST_LOG
          value: info # Values are trace, debug, info, warn, error
        ports:
        - containerPort: 9000
          name: health
        - containerPort: 7777
          name: http
        livenessProbe:
          httpGet:
            path: /live
            port: health
        readinessProbe:
          httpGet:
            path: /ready
            port: health
        securityContext:
          allowPrivilegeEscalation: false
      - name: logdriver
        image: gcr.io/ironcore-images/tenant-security-logdriver:{CHOSEN_TAG}
        resources:
          requests:
            cpu: 2
            memory: 1256Mi
          limits:
            cpu: 2
            memory: 1256Mi
        envFrom:
        - secretRef:
            # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup
            name: tsp-secrets
        env:
        - name: RUST_LOG
          value: info
        ports:
        - containerPort: 9001
          name: health
        livenessProbe:
          httpGet:
            path: /live
            port: health
        readinessProbe:
          httpGet:
            path: /ready
            port: health
        securityContext:
          allowPrivilegeEscalation: false
        volumeMounts:
          - mountPath: /logdriver
            name: logdriver
  volumeClaimTemplates:
    - metadata:
        name: logdriver
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi

Autoscaling

The autoscaling configuration for this combined StatefulSet is nearly identical to the one for the simpler TSP Deployment.
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: tenant-security-proxy
spec:
  maxReplicas: 10
  minReplicas: 2
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: tenant-security-proxy
  metrics:
    - pods:
        metric:
          name: tsp_requests_in_flight
        target:
          averageValue: 4.5
          type: AverageValue
      type: Pods
    # If you're using Kubernetes 1.20 or later, change this to a ContainerResource.
    - resource:
        name: cpu
        target:
          averageUtilization: 80
          type: Utilization
      type: Resource