1. Docs

Logdriver Deployment

The Logdriver is deployed alongside the TSP and includes health check endpoints and optional configuration by environment variables to tune performance.

Configuration

Outside of the configuration mentioned in the startup section of the overview, there are several optional environment variables that allow for tuning. In general it is recommended that you don’t specify these (which will cause the container to use the default values) unless you are instructed to adjust them to resolve an issue.

  • LOGDRIVER_CHANNEL_CAPACITY. Default: 1000. Controls the number of messages that can be held in buffers between logdriver pipeline stages. Increasing this will have a memory impact.
  • LOGDRIVER_SINK_BATCH_SIZE. Default: 1000. Maximum number of events that can be bundled into a single batch call to a tenant’s logging system. Increasing this may slow down network calls to cloud logging sinks, but may allow for faster draining of high volume tenants’ buffers.
  • LOGDRIVER_BUFFER_POLL_INTERVAL. Default: 2000. Interval (in milliseconds) between each reaping pass of tenant buffers. Decreasing this will increase data rate but may result in a buildup of uncompleted network calls that can eventually use up the container’s network resources.
  • LOGDRIVER_CONFIG_REFRESH_INTERVAL. Default: 600. Interval (in seconds) between each logdriver configuration cache refresh.
  • LOGDRIVER_CHANNEL_TIMEOUT. Default: 250. Time (in milliseconds) that pipeline channel sends are allowed before they are abandoned.
  • LOGDRIVER_EVENT_PRODUCER_URL. Default: “tcp://localhost:5555”. Target from which to pull events.

Health and Liveness Checks

The Docker container also exposes endpoints for checking liveness and health of the container. The checks are implemented based on the Kubernetes lifecycle concepts. The exposed URLs and their meaning are

  • /health: Returns a 200 status code when the container is ready to accept requests. Returns a 500 status code when the server is shutting down or is still initializing.
  • /live: Returns a 200 status code when the container is not shutting down. Returns a 500 status code when the server is shutting down.
  • /ready: Returns a 200 status code when the container is ready to accept requests. Returns a 500 status code when the server is not ready to accept requests.

The container will not report as being “ready” until it has retrieved and decrypted the initial set of tenant logging configurations from the Configuration Broker. If the Logdriver is overloaded it will also report 500 NOT_READY until it is able to work through some of its logging backlog.

Each of these health endpoints are running on port 9001 within the Docker image.

Performance

Logging performance can be measured across related dimensions: per-tenant and global.

Per-Tenant Performance

Logdriver defines a maximal rate a single tenant can sustain. By default, this is 500 log events per second per tenant, although this is tunable via configuration. This limit is in place, primarily, to provide fairness between tenants and to avoid overwhelming any particular log sink.

Global Performance

Expected sustained performance should be 5000-7000 operations per second, depending on computing and network resources. Burst performance can be higher. Keep the per tenant fairness in mind when testing the Logdriver with a low number of tenants, because total throughput will be higher when a given test rate accounts for the 500 log events per second per tenant.

Resource Usage

Memory

There are two major factors related to memory usage of the container.

There is a flat amount of memory, around 500B per-tenant, used to store information about them needed to make logging calls.

Logdriver will also buffer a maximum of 350,000 events across all tenants in memory before it goes into a NOT_READY state and drops events that it receives. Events take roughly 2KB of memory each, so this section of the memory use is capped at around 700MB.

We recommend 1256MiB as a default that will cover most use cases.

CPU

After startup, CPU will primarily be used to to marshal events through to their log sinks, mostly batching and making asynchronous HTTP calls. Logdriver also needs CPU to decrypt logging configurations, which it pulls on a 10 minute schedule. In order to allow events to be processed and sent out while decryption is taking place, a minimum of 2 CPUs should be given to the container.

We recommend 2 CPUs as a default for most use cases.

Failure Modes

Failure to Deliver Log Messages

Sustained load in excess of the per-tenant or global guidelines will eventually cause the Logdriver service to start rejecting messages. Warnings will appear in the TSP and Logdriver’s logs. The TSP will continue to function, but the log messages will not be delivered to the tenant’s logging system.

This behavior can also be triggered by very large bursts of activity.

In this case, the Logdriver container’s readiness check (/ready) will start to return 500. Traffic should be directed to another TSP+Logdriver set until /ready returns a 200. Ensure that whatever orchestration you’re using doesn’t remove all TSPs from rotation this way, since the TSP does still function in this state with reduced logging reliability and slower response times.

Troubleshooting

File Descriptor Limits

File descriptor limit errors may show up if you have too much concurrent traffic flowing through a single Logdriver instance. If you notice these errors in the logs you should either increase the file descriptor limit on that Logdriver or horizontally scale to provide more Logdrivers to the TSP the overloaded one is serving.

Example Deployments

Example Docker Compose

We don’t recommend running a simple docker compose like this in production, but it is useful to see the basics of what is needed to run the Tenant Security Proxy (TSP) and Logdriver (LD) together. If you need a more robust production example, see the kubernetes example.

YAML
version: "3.3" services: tenant-security-proxy: env_file: - ./config-broker-config.conf ports: - "7777:7777" - "9000:9000" image: tenant-security-proxy links: - tenant-security-logdriver tenant-security-logdriver: environment: - LOGDRIVER_EVENT_PRODUCER_URL=tcp://tenant-security-proxy:5555 env_file: - ./config-broker-config.conf ports: - "9001:9001" image: tenant-security-logdriver volumes: - type: bind source: /tmp target: /logdriver

Example Kubernetes Deployment

When running the TSP and LD together, LD needs a persistent disk to store log messages, to ensure nothing gets lost. To give each LD its own disk, we use a StatefulSet with a volumeClaimTemplate. The StatefulSet is a bit more complicated than the Deployment we used for TSP alone.

Here’s the StatefulSet, with both its required headless service (tenant-security-proxy-sts) and the service that will be used by TSP clients (tenant-security-proxy).

YAML
# This is the client-facing service. apiVersion: v1 kind: Service metadata: name: tenant-security-proxy labels: app: tenant-security-proxy spec: type: NodePort ports: - port: 7777 targetPort: 7777 name: http selector: app: tenant-security-proxy --- # This is the headless service used by the StatefulSet to keep track of its replicas. apiVersion: v1 kind: Service metadata: name: tenant-security-proxy-sts spec: ports: - port: 7777 name: http clusterIP: None selector: app: tenant-security-proxy --- apiVersion: apps/v1 kind: StatefulSet metadata: name: tenant-security-proxy spec: # We're not setting replicas here because that's controlled by a HorizontalPodAutoscaler. selector: matchLabels: app: tenant-security-proxy serviceName: tenant-security-proxy-sts podManagementPolicy: Parallel template: metadata: labels: app: tenant-security-proxy annotations: prometheus.io/scrape: 'true' prometheus.io/port: '7777' spec: securityContext: runAsUser: 2 # Any non-root user will do. runAsGroup: 2 fsGroup: 2 runAsNonRoot: true containers: - name: tenant-security-proxy image: gcr.io/ironcore-images/tenant-security-proxy:{CHOSEN_TAG} resources: requests: cpu: 2 memory: 1256Mi limits: cpu: 2 memory: 1256Mi envFrom: - secretRef: # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup name: tsp-secrets env: - name: RUST_LOG value: info # Values are trace, debug, info, warn, error ports: - containerPort: 9000 name: health - containerPort: 7777 name: http livenessProbe: httpGet: path: /live port: health readinessProbe: httpGet: path: /ready port: health securityContext: allowPrivilegeEscalation: false - name: logdriver image: gcr.io/ironcore-images/tenant-security-logdriver:{CHOSEN_TAG} resources: requests: cpu: 2 memory: 1256Mi limits: cpu: 2 memory: 1256Mi envFrom: - secretRef: # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup name: tsp-secrets env: - name: RUST_LOG value: info ports: - containerPort: 9001 name: health livenessProbe: httpGet: path: /live port: health readinessProbe: httpGet: path: /ready port: health securityContext: allowPrivilegeEscalation: false volumeMounts: - mountPath: /logdriver name: logdriver volumeClaimTemplates: - metadata: name: logdriver spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi

Autoscaling

The autoscaling configuration for this combined StatefulSet is nearly identical to the one for the simpler TSP Deployment.

YAML
apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: tenant-security-proxy spec: maxReplicas: 10 minReplicas: 2 scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: tenant-security-proxy metrics: - pods: metric: name: tsp_requests_in_flight target: averageValue: 4.5 type: AverageValue type: Pods # If you're using Kubernetes 1.20 or later, change this to a ContainerResource. - resource: name: cpu target: averageUtilization: 80 type: Utilization type: Resource

Was this page helpful?