Docs
SaaS Shield
Suite
Tenant Security Logdriver
Deployment

Tenant Security Logdriver Deployment

The Tenant Security Logdriver (TSL) is deployed alongside the Tenant Security Proxy (TSP). The service includes health check endpoints, and it can be further configured using optional environment variables to tune performance.

Configuration

Outside of the configuration mentioned in the startup section of the overview, there are several optional environment variables that allow for tuning. In general it is recommended that you don’t specify these (which will cause the container to use the default values) unless you are instructed to adjust them to resolve an issue.

LOGDRIVER_CHANNEL_CAPACITY. Default: 1000. Controls the number of messages that can be held in buffers between logdriver pipeline stages. Increasing this will have a memory impact.
LOGDRIVER_SINK_BATCH_SIZE. Default: 1000. Maximum number of events that can be bundled into a single batch call to a tenant’s logging system. Increasing this may slow down individual network calls to cloud logging sinks but will allow for faster draining of high volume tenants’ buffers. Incleasing this should be your first go-to for improving TSL event throughput.
LOGDRIVER_BUFFER_POLL_INTERVAL. Default: 2000. Maximum age (in milliseconds) of a less-than-full batch for a tenant before it is sent out. Lengthening this increases the resources available to high throughput tenants at the cost of tenants with less-than-full batches waiting longer for events to arrive at their log sink. Shortening it has the inverse effect.
LOGDRIVER_CONFIG_REFRESH_INTERVAL. Default: 600. Interval (in seconds) between each logdriver configuration cache refresh.
LOGDRIVER_CHANNEL_TIMEOUT. Default: 250. Time (in milliseconds) that pipeline channel sends are allowed before they are abandoned.
LOGDRIVER_EVENT_PRODUCER_URL. Default: “tcp://localhost:5555”. Target from which to pull events.

Health and Liveness Checks

The Docker container also exposes endpoints for checking liveness and health of the container. The checks are implemented based on the Kubernetes lifecycle concepts. The exposed URLs and their meaning are

/health: Returns a 200 status code when the container is ready to accept requests. Returns a 500 status code when the server is shutting down or is still initializing.
/live: Returns a 200 status code when the container is not shutting down. Returns a 500 status code when the server is shutting down.
/ready: Returns a 200 status code when the container is ready to accept requests. Returns a 500 status code when the server is not ready to accept requests.

The container will not report as being “ready” until it has retrieved and decrypted the initial set of tenant logging configurations from the Configuration Broker. If the TSL is overloaded it will also report 500 NOT_READY until it is able to work through some of its logging backlog.

Each of these health endpoints are running on port 9001 within the Docker image.

Performance

Logging performance can be measured across related dimensions: per-tenant and global.

Per-Tenant Performance

The TSL attempts to provide fairness to all tenants by limiting the maximum age of a batch for any tenant to LOGDRIVER_BUFFER_POLL_INTERVAL. High volume tenants that are reaching full batches before that time period will receive resources any time they hit a full batch, but have the same priority as any low volume tenant whose less-than-full batch is past the max age. This results in efficient resource usage while one or more tenants are under heavy load while still allowing low load tenants to send out their events in a timely manner.

Global Performance

Expected sustained performance should be 30-50k operations per second coming into the TSP for a 2 CPU TSL, depending on computing and network resources. Burst performance can be higher, but throughput is to some extent dependant on downstream log sink throughputs. If the TSP’s tsp_real_time_security_event_failures_total is increasing, all upstream buffers are saturated and you should increase the number of TSL sidecars and/or increase the LOGDRIVER_SINK_BATCH_SIZE.

Resource Usage

Memory

There are two major factors related to memory usage of the container.

There is a flat amount of memory, around 500B per-tenant, used to store information about them needed to make logging calls.

The TSL will also buffer a maximum of 350,000 events across all tenants in memory before it goes into a NOT_READY state and drops events that it receives. Standard events take roughly 120B - 2KB of memory each, so this section of the memory use is capped at around 700MB. If you’re sending custom security events that are significantly larger, increase memory to compensate based on worst case usage.

We recommend 1256MB as a default that will cover most use cases.

CPU

After startup, CPU will primarily be used to to marshal events through to their log sinks, mostly batching, making asynchronous HTTP calls, and writing to stdout. The TSL also needs CPU to decrypt logging configurations, which it pulls on a 10 minute schedule, and for database compaction. In order to allow events to be processed and sent out while CPU bound background tasks are taking place, a minimum of 2 CPUs should be given to the container.

We recommend 2 CPUs as a default for most use cases.

Failure Modes

Failure to Deliver Log Messages

Sustained load in excess of global guidelines will eventually cause the TSL’s channels to back up to the TSP, which in turn will cause the TSP to start dropping events. Warnings will appear in the TSP and TSL logs. The TSP will continue to function, but the log messages will not be delivered to the tenant’s logging system.

This behavior can also be triggered by very large bursts of activity.

In this case, the TSL container’s readiness check (/ready) will start to return 500. Traffic should be directed to another TSP+TSL set until /ready returns a 200. Ensure that whatever orchestration you’re using doesn’t remove all TSPs from rotation this way, since the TSP does still function in this state.

Troubleshooting

File Descriptor Limits

File descriptor limit errors may show up if you have high concurrent traffic flowing through a single TSL instance. If you notice these errors in the logs you should either increase the file descriptor limit associated with TSL or increase the number of sidecar TSLs you have per TSP.

Example Deployments

Example Docker Compose

We don’t recommend running a simple docker compose like this in production, but it is useful to see the basics of what is needed to run the Tenant Security Proxy (TSP) and Tenant Security Logdriver (TSL) together. If you need a more robust production example, see the kubernetes example.

YAML

version: "3.3"
services:
  tenant-security-proxy:
    env_file:
      - ./config-broker-config.conf
    ports:
      - "7777:7777"
      - "9000:9000"
    image: tenant-security-proxy
    links:
      - tenant-security-logdriver
  tenant-security-logdriver:
    environment:
      - LOGDRIVER_EVENT_PRODUCER_URL=tcp://tenant-security-proxy:5555
    env_file:
      - ./config-broker-config.conf
    ports:
      - "9001:9001"
    image: tenant-security-logdriver
    volumes:
      - type: bind
        source: /tmp
        target: /logdriver

Example Kubernetes Deployment

When running the TSP and TSL together, TSL needs a persistent disk to store log messages, to ensure nothing gets lost. To give each TSL its own disk, we use a StatefulSet with a volumeClaimTemplate. The StatefulSet is a bit more complicated than the Deployment we used for the TSP alone.

Here’s the StatefulSet, with both its required headless service (tenant-security-proxy-sts) and the service that will be used by Tenant Security and Alloy clients (tenant-security-proxy), with TSL running as a sidecar.

YAML

# This is the client-facing service.
apiVersion: v1
kind: Service
metadata:
  name: tenant-security-proxy
  labels:
    app: tenant-security-proxy
spec:
  type: NodePort
  ports:
  - port: 7777
    targetPort: 7777
    name: http
  selector:
    app: tenant-security-proxy

---

# This is the headless service used by the StatefulSet to keep track of its replicas.
apiVersion: v1
kind: Service
metadata:
  name: tenant-security-proxy-sts
spec:
  ports:
    - port: 7777
      name: http
  clusterIP: None
  selector:
    app: tenant-security-proxy

---

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: tenant-security-proxy
spec:
  # We're not setting replicas here because that's controlled by a HorizontalPodAutoscaler.
  selector:
    matchLabels:
      app: tenant-security-proxy
  serviceName: tenant-security-proxy-sts
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: tenant-security-proxy
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '7777'
    spec:
      securityContext:
        runAsUser: 2 # Any non-root user will do.
        runAsGroup: 2
        fsGroup: 2
        runAsNonRoot: true
      containers:
      - name: tenant-security-proxy
        image: gcr.io/ironcore-images/tenant-security-proxy:{CHOSEN_TAG}
        resources:
          requests:
            cpu: 2
            memory: 500MB
          limits:
            cpu: 2
            memory: 500MB
        envFrom:
        - secretRef:
            # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup
            name: tsp-secrets
        env:
        - name: RUST_LOG
          value: info # Values are trace, debug, info, warn, error
        ports:
        - containerPort: 9000
          name: health
        - containerPort: 7777
          name: http
        livenessProbe:
          httpGet:
            path: /live
            port: health
        readinessProbe:
          httpGet:
            path: /ready
            port: health
        securityContext:
          allowPrivilegeEscalation: false
      - name: logdriver
        image: gcr.io/ironcore-images/tenant-security-logdriver:{CHOSEN_TAG}
        resources:
          requests:
            cpu: 2
            memory: 1256MB
          limits:
            cpu: 2
            memory: 1256MB
        envFrom:
        - secretRef:
            # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup
            name: tsp-secrets
        env:
        - name: RUST_LOG
          value: info
        ports:
        - containerPort: 9001
          name: health
        livenessProbe:
          httpGet:
            path: /live
            port: health
        readinessProbe:
          httpGet:
            path: /ready
            port: health
        securityContext:
          allowPrivilegeEscalation: false
        volumeMounts:
          - mountPath: /logdriver
            name: logdriver
  volumeClaimTemplates:
    - metadata:
        name: logdriver
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1GB

TSL binds its own health check address (9001), so it’s not possible to have multple TSLs in a sidecar configuration like this. If you want to be able to scale TSPs and have a number of TSLs tied to that new TSP also scale up you can instead:

create a TSP StatefulSet defining just the TSP container
create a TSL StatefulSet defining the TSL container and its volumeClaimTemplate. LOGDRIVER_EVENT_PRODUCER_URL will need to be set dynamically, which differs by containerization provider
create a custom controller to watch for TSPs and scale up TSLs by a replication parameter

Autoscaling

The autoscaling configuration for this combined StatefulSet is nearly identical to the one for the simpler TSP Deployment.

YAML

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: tenant-security-proxy
spec:
  maxReplicas: 10
  minReplicas: 2
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: tenant-security-proxy
  metrics:
    # If you're using Kubernetes 1.20 or later, change this to a ContainerResource.
    - resource:
        name: cpu
        target:
          averageUtilization: 75
          type: Utilization
      type: Resource