Tenant Security Proxy Deployment

The Tenant Security Proxy (TSP) Docker container includes health check endpoints and some configuration options. We also have recommended starting points for computing resources and examples of deployment files that may be helpful in creating your own deployments.

Configuration

Outside of the configuration mentioned in the overview, there are several optional environment variables that allow for tuning. In general it is recommended that you don’t specify these (which will cause the container to use the default values) unless you are instructed to adjust them to resolve an issue.
  • TSP_SEND_LOGGING_TIMEOUT_MS. Default: 25. Maximum time (in milliseconds) to wait when attempting to send a log event. If the time is exceeded, a warning will be emitted and the event will be dropped.
  • TSP_SEND_HIGH_WATER_MARK. Default: 25000. The high water mark for the PUSH socket, or the number of events the socket can hold before blocking.
  • TSP_EVENT_LOG_CHANNEL_SIZE. Default: 25000. The number of logging events that can be buffered by the TSP without blocking.
  • TSP_ENABLE_LOGDRIVER_INTEGRATION. Default: true. Flag to enabled or disable integration with the Logdriver. If this TSP will never be connected to a Logdriver this should be set to false to avoid slowing the TSP down as it times out every logging request.

Health and Liveness Checks

The Docker container also exposes endpoints for checking liveness and health of the container. The checks are implemented based on the Kubernetes lifecycle concepts. The exposed URLs and their meaning are
  • /health: Returns a 200 status code when the container is ready to accept requests. Returns a 500 status code when the server is shutting down or is still initializing.
  • /live: Returns a 200 status code when the container is not shutting down. Returns a 500 status code when the server is shutting down.
  • /ready: Returns a 200 status code when the container is ready to accept requests. Returns a 500 status code when the server is not ready to accept requests.
The container will not report READY until it has retrieved and decrypted the initial set of tenant KMS configurations from the Configuration Broker. Each of these health endpoints are running on port 9000 within the Docker image.

Metrics

Each TSP container provides the following Prometheus metrics on a /metrics endpoint.
tsp_request_duration_seconds (histogram)
  • Processing time (in seconds) for a TSP request
  • Labels:
    • endpoint - TSP endpoint that was called
    • kms_type - current primary KMS for the tenant
    • http_code - HTTP code being returned to the caller
    • tsp_status_code - TSP status code being returned to the caller
tsp_kms_request_duration_seconds (histogram)
  • Observed latency for KMS requests
  • Labels:
    • kms_operation - KMS operation type
    • kms_type - current primary KMS for the tenant
    • http_code - HTTP code being returned to the caller
tsp_requests_in_flight (gauge)
  • Number of requests that are currently in-flight (being processed)
  • Labels:
    • endpoint - TSP endpoint that was called
tsp_key_operations_total (counter)
  • Number of successful cryptographic operations performed. These operations are primarily the wrapping and unwrapping of document encryption keys; these are counted whether they required a request to an external KMS or were done locally using a leased key. Additional operations include the wrapping and unwrapping of leased keys.
  • Labels:
    • operation - Key operation type
tsp_kms_last_config_refresh_timestamp_seconds (gauge)
  • Time (seconds since Unix epoch, UTC) of last successful refresh of the KMS configuration
  • Labels: None
tsp_real_time_security_event_failures_total (counter)
  • Number of times the TSP has been unable to push a Real Time Security Event since startup.
  • Labels: None
tsp_config_refresh_failures_total (counter)
  • Number of times the TSP has been unable to refresh its KMS configuration since startup.
  • Labels: None
If Prometheus is available in your environment, it can collect these metrics from the TSPs. Once Prometheus has started gathering metrics, you can display them in a variety of ways using Grafana.
A sample Grafana dashboard can be imported from this JSON file: TSP - All Metrics. This dashboard contains a variety of panels relating to request latency, error rates, key operations, and other queries useful for troubleshooting the TSP. The import process for Grafana is described here.

Performance

Expected sustained performance without Logdriver should be 5000-7000 operations per second, depending on computing and network resources. Burst performance can be higher.
See the Logdriver expected performance for more details on performance in combination with Logdriver.

Horizontal Scaling

TSP is horizontally scalable. We recommend at least two instances behind a load balancer for redundancy. Since a single TSP can handle a global maximum of 5000-7000 operations/sec, you can use that to roughly calculate the number of TSPs to put behind your load balancer (recommended minimum of 2).
Good signals for auto-scaling are the /ready readiness check and the rate(tsp_real_time_security_event_failures_total) > 1 and tsp_requests_in_flight > 1 metrics.

Resource Usage

Memory

TSP memory usage is linear with the number of tenants. Baseline usage with no tenants is around 2.4MiB. A tenant with a few KMS configs takes up around 200KiB.
We recommend 1256Mi as a default that will cover most use cases.

CPU

Most of the TSP’s CPU usage is for serving requests or decrypting KMS configs. Setting a minimum of 2 CPUs ensures that progress can always be made on both decrypting configs and making async KMS calls at the same time. More CPUs will allow for more throughput but should be tuned based on what your deployment’s usage looks like.
We recommend 2 CPUs as a default for most use cases.

Failure Modes

Configuration Broker Down

The TSP retrieves KMS configurations for tenants from the Configuration Broker (CB). If the CB is inaccessible or down these things will be true:
  • new TSPs will not be able to start up until they can reach the CB to get tenant’s KMS configs
  • each TSP container that had already started up successfully will continue to operate with the configs that it previously retrieved for up to 24hrs
  • running TSPs will not recieve updates from the CB such as disabled configs or updated KMS credentials, which may result in some tenants' calls to their KMS to fail

Logdriver Queue Overloaded

If the TSP_ENABLE_LOGDRIVER_INTEGRATION option is true it’s possible for this TSP’s queue that Logdrivers pull events from to fill up. When that happens calls to log events (adding them to that queue) will begin to timeout after TSP_SEND_LOGGING_TIMEOUT_MS (default of 25ms). The result of that timeout is that calls from the TSC (including encrypt/decrypt) may be delayed. See the Logdriver failure modes for more information.

Troubleshooting

File Descriptor Limits

When performing batch operations within the Proxy, caution should be used if the batch size is large enough to cause many requests to be made to a tenant’s KMS. When a batch operation is performed, multiple parallel requests will be made to the tenant’s KMS, one for each key to wrap. If the batch size is large enough, this can cause the number of file descriptors requested from the container to exceed the available resources, creating errors. On Linux, the default file descriptor limit is 1024, so it is best to limit the number of items in a batch operation to no more than 1000 at a time.
File descriptor limit errors may also show up if you have enough concurrent traffic flowing through a single TSP deployment. If you notice these errors in the logs and are not submitting large batches, you should either increase the file descriptor limit on that TSP or horizontally scale and load balance the traffic.

Example Deployments

Example Docker Compose

See the Logdriver documentation for an example of running it and the TSP together using Docker compose.

Example Kubernetes Deployment

In general, we recommend running the Tenant Security Proxy and Logdriver together; for that configuration, see the Logdriver deployment documentation. If you really only need the TSP, this config is a bit simpler.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tenant-security-proxy
spec:
  selector:
    matchLabels:
      app: tenant-security-proxy
  template:
    metadata:
      labels:
        app: tenant-security-proxy
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '7777'
    spec:
      securityContext:
        runAsUser: 2 # Any non-root user will do.
        runAsGroup: 2
        fsGroup: 2
        runAsNonRoot: true
      containers:
      - name: tenant-security-proxy
        image: gcr.io/ironcore-images/tenant-security-proxy:3.3.6
        resources:
          # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup
          requests:
            cpu: 2
            memory: 1256Mi
          limits:
            cpu: 2
            memory: 1256Mi
        envFrom:
        - secretRef:
            # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup
            name: tsp-secrets
        env:
        - name: RUST_LOG
          value: info # Values are trace, debug, info, warn, error
        ports:
        - containerPort: 9000
          name: health
        - containerPort: 7777
          name: http
        livenessProbe:
          httpGet:
            path: /live
            port: health
        readinessProbe:
          httpGet:
            path: /ready
            port: health
        securityContext:
          allowPrivilegeEscalation: false

Autoscaling

For High Availability, we recommend running at least 2 replicas of the TSP in different zones. For most workloads, 2 replicas is plenty; however, the TSP can easily be scaled up or down. The service is generally CPU bound, so it’s possible to autoscale based on CPU alone; but we recommend using the sum(tsp_requests_in_flight) metric for better accuracy and responsiveness in scaling. A typical installation of the Prometheus Custom Metrics Adapter will implicitly apply sum(...) to the sampled metrics, so a config like this should work:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: tenant-security-proxy
spec:
  maxReplicas: 10
  minReplicas: 2
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tenant-security-proxy
  metrics:
    - pods:
        metric:
          name: tsp_requests_in_flight
        target:
          averageValue: 4.5
          type: AverageValue
      type: Pods
    # If you're using Kubernetes 1.20 or later, change this to a ContainerResource.
    - resource:
        name: cpu
        target:
          averageUtilization: 80
          type: Utilization
      type: Resource