- Docs
- SaaS Shield
- Suite
- Tenant Security Logdriver
- Deployment
Tenant Security Logdriver Deployment
The Tenant Security Logdriver (TSL) is deployed alongside the Tenant Security Proxy (TSP). The service includes health check endpoints, and it can be further configured using optional environment variables to tune performance.
Configuration
Outside of the configuration mentioned in the startup section of the overview, there are several optional environment variables that allow for tuning. In general it is recommended that you don’t specify these (which will cause the container to use the default values) unless you are instructed to adjust them to resolve an issue.
LOGDRIVER_CHANNEL_CAPACITY
. Default: 1000. Controls the number of messages that can be held in buffers between logdriver pipeline stages. Increasing this will have a memory impact.LOGDRIVER_SINK_BATCH_SIZE
. Default: 1000. Maximum number of events that can be bundled into a single batch call to a tenant’s logging system. Increasing this may slow down individual network calls to cloud logging sinks but will allow for faster draining of high volume tenants’ buffers. Incleasing this should be your first go-to for improving TSL event throughput.LOGDRIVER_BUFFER_POLL_INTERVAL
. Default: 2000. Maximum age (in milliseconds) of a less-than-full batch for a tenant before it is sent out. Lengthening this increases the resources available to high throughput tenants at the cost of tenants with less-than-full batches waiting longer for events to arrive at their log sink. Shortening it has the inverse effect.LOGDRIVER_CONFIG_REFRESH_INTERVAL
. Default: 600. Interval (in seconds) between each logdriver configuration cache refresh.LOGDRIVER_CHANNEL_TIMEOUT
. Default: 250. Time (in milliseconds) that pipeline channel sends are allowed before they are abandoned.LOGDRIVER_EVENT_PRODUCER_URL
. Default: “tcp://localhost:5555”. Target from which to pull events.
Health and Liveness Checks
The Docker container also exposes endpoints for checking liveness and health of the container. The checks are implemented based on the Kubernetes lifecycle concepts. The exposed URLs and their meaning are
/health
: Returns a200
status code when the container is ready to accept requests. Returns a500
status code when the server is shutting down or is still initializing./live
: Returns a200
status code when the container is not shutting down. Returns a500
status code when the server is shutting down./ready
: Returns a200
status code when the container is ready to accept requests. Returns a500
status code when the server is not ready to accept requests.
The container will not report as being “ready” until it has retrieved and decrypted the initial set of tenant logging configurations from the Configuration Broker. If the TSL is overloaded it will also report 500 NOT_READY
until it is able to work through some of its logging backlog.
Each of these health endpoints are running on port 9001
within the Docker image.
Performance
Logging performance can be measured across related dimensions: per-tenant and global.
Per-Tenant Performance
The TSL attempts to provide fairness to all tenants by limiting the maximum age of a batch for any tenant to LOGDRIVER_BUFFER_POLL_INTERVAL
. High volume tenants that are reaching full batches before that time period will receive resources any time they hit a full batch, but have the same priority as any low volume tenant whose less-than-full batch is past the max age. This results in efficient resource usage while one or more tenants are under heavy load while still allowing low load tenants to send out their events in a timely manner.
Global Performance
Expected sustained performance should be 30-50k operations per second coming into the TSP for a 2 CPU TSL, depending on computing and network resources. Burst performance can be higher, but throughput is to some extent dependant on downstream log sink throughputs. If the TSP’s tsp_real_time_security_event_failures_total
is increasing, all upstream buffers are saturated and you should increase the number of TSL sidecars and/or increase the LOGDRIVER_SINK_BATCH_SIZE
.
Resource Usage
Memory
There are two major factors related to memory usage of the container.
There is a flat amount of memory, around 500B per-tenant
, used to store information about them needed to make logging calls.
The TSL will also buffer a maximum of 350,000 events across all tenants in memory before it goes into a NOT_READY
state and drops events that it receives. Standard events take roughly 120B - 2KB
of memory each, so this section of the memory use is capped at around 700MB
. If you’re sending custom security events that are significantly larger, increase memory to compensate based on worst case usage.
We recommend 1256MB
as a default that will cover most use cases.
CPU
After startup, CPU will primarily be used to to marshal events through to their log sinks, mostly batching, making asynchronous HTTP calls, and writing to stdout. The TSL also needs CPU to decrypt logging configurations, which it pulls on a 10 minute schedule, and for database compaction. In order to allow events to be processed and sent out while CPU bound background tasks are taking place, a minimum of 2 CPUs should be given to the container.
We recommend 2 CPUs as a default for most use cases.
Failure Modes
Failure to Deliver Log Messages
Sustained load in excess of global guidelines will eventually cause the TSL’s channels to back up to the TSP, which in turn will cause the TSP to start dropping events. Warnings will appear in the TSP and TSL logs. The TSP will continue to function, but the log messages will not be delivered to the tenant’s logging system.
This behavior can also be triggered by very large bursts of activity.
In this case, the TSL container’s readiness check (/ready
) will start to return 500
. Traffic should be directed to another TSP+TSL set until /ready
returns a 200
. Ensure that whatever orchestration you’re using doesn’t remove all TSPs from rotation this way, since the TSP does still function in this state.
Troubleshooting
File Descriptor Limits
File descriptor limit errors may show up if you have high concurrent traffic flowing through a single TSL instance. If you notice these errors in the logs you should either increase the file descriptor limit associated with TSL or increase the number of sidecar TSLs you have per TSP.
Example Deployments
Example Docker Compose
We don’t recommend running a simple docker compose like this in production, but it is useful to see the basics of what is needed to run the Tenant Security Proxy (TSP) and Tenant Security Logdriver (TSL) together. If you need a more robust production example, see the kubernetes example.
YAMLversion: "3.3" services: tenant-security-proxy: env_file: - ./config-broker-config.conf ports: - "7777:7777" - "9000:9000" image: tenant-security-proxy links: - tenant-security-logdriver tenant-security-logdriver: environment: - LOGDRIVER_EVENT_PRODUCER_URL=tcp://tenant-security-proxy:5555 env_file: - ./config-broker-config.conf ports: - "9001:9001" image: tenant-security-logdriver volumes: - type: bind source: /tmp target: /logdriver
Example Kubernetes Deployment
When running the TSP and TSL together, TSL needs a persistent disk to store log messages, to ensure nothing gets lost. To give each TSL its own disk, we use a StatefulSet
with a volumeClaimTemplate
. The StatefulSet
is a bit more complicated than the Deployment
we used for the TSP alone.
Here’s the StatefulSet, with both its required
headless service
(tenant-security-proxy-sts
) and the service that will be used by Tenant Security and Alloy clients (tenant-security-proxy
), with TSL running as a sidecar.
YAML# This is the client-facing service. apiVersion: v1 kind: Service metadata: name: tenant-security-proxy labels: app: tenant-security-proxy spec: type: NodePort ports: - port: 7777 targetPort: 7777 name: http selector: app: tenant-security-proxy --- # This is the headless service used by the StatefulSet to keep track of its replicas. apiVersion: v1 kind: Service metadata: name: tenant-security-proxy-sts spec: ports: - port: 7777 name: http clusterIP: None selector: app: tenant-security-proxy --- apiVersion: apps/v1 kind: StatefulSet metadata: name: tenant-security-proxy spec: # We're not setting replicas here because that's controlled by a HorizontalPodAutoscaler. selector: matchLabels: app: tenant-security-proxy serviceName: tenant-security-proxy-sts podManagementPolicy: Parallel template: metadata: labels: app: tenant-security-proxy annotations: prometheus.io/scrape: 'true' prometheus.io/port: '7777' spec: securityContext: runAsUser: 2 # Any non-root user will do. runAsGroup: 2 fsGroup: 2 runAsNonRoot: true containers: - name: tenant-security-proxy image: gcr.io/ironcore-images/tenant-security-proxy:{CHOSEN_TAG} resources: requests: cpu: 2 memory: 500MB limits: cpu: 2 memory: 500MB envFrom: - secretRef: # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup name: tsp-secrets env: - name: RUST_LOG value: info # Values are trace, debug, info, warn, error ports: - containerPort: 9000 name: health - containerPort: 7777 name: http livenessProbe: httpGet: path: /live port: health readinessProbe: httpGet: path: /ready port: health securityContext: allowPrivilegeEscalation: false - name: logdriver image: gcr.io/ironcore-images/tenant-security-logdriver:{CHOSEN_TAG} resources: requests: cpu: 2 memory: 1256MB limits: cpu: 2 memory: 1256MB envFrom: - secretRef: # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup name: tsp-secrets env: - name: RUST_LOG value: info ports: - containerPort: 9001 name: health livenessProbe: httpGet: path: /live port: health readinessProbe: httpGet: path: /ready port: health securityContext: allowPrivilegeEscalation: false volumeMounts: - mountPath: /logdriver name: logdriver volumeClaimTemplates: - metadata: name: logdriver spec: accessModes: - ReadWriteOnce resources: requests: storage: 1GB
TSL binds its own health check address (9001
), so it’s not possible to have multple TSLs in a sidecar configuration like this. If you want to be able to scale TSPs and have a number of TSLs tied to that new TSP also scale up you can instead:
- create a TSP
StatefulSet
defining just the TSP container - create a TSL
StatefulSet
defining the TSL container and itsvolumeClaimTemplate
.LOGDRIVER_EVENT_PRODUCER_URL
will need to be set dynamically, which differs by containerization provider - create a custom controller to watch for TSPs and scale up TSLs by a replication parameter
Autoscaling
The autoscaling configuration for this combined StatefulSet is nearly identical to the one for the simpler TSP Deployment.
YAMLapiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: tenant-security-proxy spec: maxReplicas: 10 minReplicas: 2 scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: tenant-security-proxy metrics: # If you're using Kubernetes 1.20 or later, change this to a ContainerResource. - resource: name: cpu target: averageUtilization: 75 type: Utilization type: Resource