- Docs
- SaaS Shield
- Suite
- Tenant Security Logdriver
- Deployment
Logdriver Deployment
The Logdriver is deployed alongside the TSP and includes health check endpoints and optional configuration by environment variables to tune performance.
Configuration
Outside of the configuration mentioned in the startup section of the overview, there are several optional environment variables that allow for tuning. In general it is recommended that you don’t specify these (which will cause the container to use the default values) unless you are instructed to adjust them to resolve an issue.
LOGDRIVER_CHANNEL_CAPACITY
. Default: 1000. Controls the number of messages that can be held in buffers between logdriver pipeline stages. Increasing this will have a memory impact.LOGDRIVER_SINK_BATCH_SIZE
. Default: 1000. Maximum number of events that can be bundled into a single batch call to a tenant’s logging system. Increasing this may slow down network calls to cloud logging sinks, but may allow for faster draining of high volume tenants’ buffers.LOGDRIVER_BUFFER_POLL_INTERVAL
. Default: 2000. Interval (in milliseconds) between each reaping pass of tenant buffers. Decreasing this will increase data rate but may result in a buildup of uncompleted network calls that can eventually use up the container’s network resources.LOGDRIVER_CONFIG_REFRESH_INTERVAL
. Default: 600. Interval (in seconds) between each logdriver configuration cache refresh.LOGDRIVER_CHANNEL_TIMEOUT
. Default: 250. Time (in milliseconds) that pipeline channel sends are allowed before they are abandoned.LOGDRIVER_EVENT_PRODUCER_URL
. Default: “tcp://localhost:5555”. Target from which to pull events.
Health and Liveness Checks
The Docker container also exposes endpoints for checking liveness and health of the container. The checks are implemented based on the Kubernetes lifecycle concepts. The exposed URLs and their meaning are
/health
: Returns a200
status code when the container is ready to accept requests. Returns a500
status code when the server is shutting down or is still initializing./live
: Returns a200
status code when the container is not shutting down. Returns a500
status code when the server is shutting down./ready
: Returns a200
status code when the container is ready to accept requests. Returns a500
status code when the server is not ready to accept requests.
The container will not report as being “ready” until it has retrieved and decrypted the initial set of tenant logging configurations from the Configuration Broker. If the Logdriver is overloaded it will also report 500 NOT_READY
until it is able to work through some of its logging backlog.
Each of these health endpoints are running on port 9001
within the Docker image.
Performance
Logging performance can be measured across related dimensions: per-tenant and global.
Per-Tenant Performance
Logdriver defines a maximal rate a single tenant can sustain. By default, this is 500 log events per second per tenant, although this is tunable via configuration. This limit is in place, primarily, to provide fairness between tenants and to avoid overwhelming any particular log sink.
Global Performance
Expected sustained performance should be 5000-7000 operations per second, depending on computing and network resources. Burst performance can be higher. Keep the per tenant fairness in mind when testing the Logdriver with a low number of tenants, because total throughput will be higher when a given test rate accounts for the 500 log events per second per tenant.
Resource Usage
Memory
There are two major factors related to memory usage of the container.
There is a flat amount of memory, around 500B per-tenant
, used to store information about them needed to make logging calls.
Logdriver will also buffer a maximum of 350,000 events across all tenants in memory before it goes into a NOT_READY
state and drops events that it receives. Events take roughly 2KB
of memory each, so this section of the memory use is capped at around 700MB
.
We recommend 1256MiB
as a default that will cover most use cases.
CPU
After startup, CPU will primarily be used to to marshal events through to their log sinks, mostly batching and making asynchronous HTTP calls. Logdriver also needs CPU to decrypt logging configurations, which it pulls on a 10 minute schedule. In order to allow events to be processed and sent out while decryption is taking place, a minimum of 2 CPUs should be given to the container.
We recommend 2 CPUs as a default for most use cases.
Failure Modes
Failure to Deliver Log Messages
Sustained load in excess of the per-tenant or global guidelines will eventually cause the Logdriver service to start rejecting messages. Warnings will appear in the TSP and Logdriver’s logs. The TSP will continue to function, but the log messages will not be delivered to the tenant’s logging system.
This behavior can also be triggered by very large bursts of activity.
In this case, the Logdriver container’s readiness check (/ready
) will start to return 500
. Traffic should be directed to another TSP+Logdriver set until /ready
returns a 200
. Ensure that whatever orchestration you’re using doesn’t remove all TSPs from rotation this way, since the TSP does still function in this state with reduced logging reliability and slower response times.
Troubleshooting
File Descriptor Limits
File descriptor limit errors may show up if you have too much concurrent traffic flowing through a single Logdriver instance. If you notice these errors in the logs you should either increase the file descriptor limit on that Logdriver or horizontally scale to provide more Logdrivers to the TSP the overloaded one is serving.
Example Deployments
Example Docker Compose
We don’t recommend running a simple docker compose like this in production, but it is useful to see the basics of what is needed to run the Tenant Security Proxy (TSP) and Logdriver (LD) together. If you need a more robust production example, see the kubernetes example.
YAMLversion: "3.3" services: tenant-security-proxy: env_file: - ./config-broker-config.conf ports: - "7777:7777" - "9000:9000" image: tenant-security-proxy links: - tenant-security-logdriver tenant-security-logdriver: environment: - LOGDRIVER_EVENT_PRODUCER_URL=tcp://tenant-security-proxy:5555 env_file: - ./config-broker-config.conf ports: - "9001:9001" image: tenant-security-logdriver volumes: - type: bind source: /tmp target: /logdriver
Example Kubernetes Deployment
When running the TSP and LD together, LD needs a persistent disk to store log messages, to
ensure nothing gets lost. To give each LD its own disk, we use a StatefulSet
with a volumeClaimTemplate
. The StatefulSet
is
a bit more complicated than the Deployment
we used for TSP alone.
Here’s the StatefulSet, with both its required
headless service
(tenant-security-proxy-sts
) and the service that will be used by TSP clients (tenant-security-proxy
).
YAML# This is the client-facing service. apiVersion: v1 kind: Service metadata: name: tenant-security-proxy labels: app: tenant-security-proxy spec: type: NodePort ports: - port: 7777 targetPort: 7777 name: http selector: app: tenant-security-proxy --- # This is the headless service used by the StatefulSet to keep track of its replicas. apiVersion: v1 kind: Service metadata: name: tenant-security-proxy-sts spec: ports: - port: 7777 name: http clusterIP: None selector: app: tenant-security-proxy --- apiVersion: apps/v1 kind: StatefulSet metadata: name: tenant-security-proxy spec: # We're not setting replicas here because that's controlled by a HorizontalPodAutoscaler. selector: matchLabels: app: tenant-security-proxy serviceName: tenant-security-proxy-sts podManagementPolicy: Parallel template: metadata: labels: app: tenant-security-proxy annotations: prometheus.io/scrape: 'true' prometheus.io/port: '7777' spec: securityContext: runAsUser: 2 # Any non-root user will do. runAsGroup: 2 fsGroup: 2 runAsNonRoot: true containers: - name: tenant-security-proxy image: gcr.io/ironcore-images/tenant-security-proxy:{CHOSEN_TAG} resources: requests: cpu: 2 memory: 1256Mi limits: cpu: 2 memory: 1256Mi envFrom: - secretRef: # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup name: tsp-secrets env: - name: RUST_LOG value: info # Values are trace, debug, info, warn, error ports: - containerPort: 9000 name: health - containerPort: 7777 name: http livenessProbe: httpGet: path: /live port: health readinessProbe: httpGet: path: /ready port: health securityContext: allowPrivilegeEscalation: false - name: logdriver image: gcr.io/ironcore-images/tenant-security-logdriver:{CHOSEN_TAG} resources: requests: cpu: 2 memory: 1256Mi limits: cpu: 2 memory: 1256Mi envFrom: - secretRef: # See https://ironcorelabs.com/docs/saas-shield/tenant-security-proxy/overview/#startup name: tsp-secrets env: - name: RUST_LOG value: info ports: - containerPort: 9001 name: health livenessProbe: httpGet: path: /live port: health readinessProbe: httpGet: path: /ready port: health securityContext: allowPrivilegeEscalation: false volumeMounts: - mountPath: /logdriver name: logdriver volumeClaimTemplates: - metadata: name: logdriver spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi
Autoscaling
The autoscaling configuration for this combined StatefulSet is nearly identical to the one for the simpler TSP Deployment.
YAMLapiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: tenant-security-proxy spec: maxReplicas: 10 minReplicas: 2 scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: tenant-security-proxy metrics: - pods: metric: name: tsp_requests_in_flight target: averageValue: 4.5 type: AverageValue type: Pods # If you're using Kubernetes 1.20 or later, change this to a ContainerResource. - resource: name: cpu target: averageUtilization: 80 type: Utilization type: Resource