Troubleshooting
Common issues and their solutions. Each entry describes the problem, what you will observe, and the steps to resolve it.
Node Not Appearing in Dashboard
Symptom
You installed the Odysseus agent on a node, but the node does not appear in the dashboard under Nodes.
Solution
-
Verify Docker is running:
docker infoIf Docker is not running, start it with
sudo systemctl start docker. The agent requires a running Docker daemon. -
Check WireGuard connectivity:
sudo wg showYou should see an active WireGuard interface with a recent handshake timestamp. If there is no handshake, check that your node can make outbound UDP connections to the control plane endpoint.
-
Verify the enrollment token:
Enrollment tokens expire after a configurable period (default: 24 hours). If your token has expired, generate a new one from Settings > Enrollment Tokens in the dashboard and re-run the agent installation.
-
Check agent logs:
sudo journalctl -u odysseus-agent --since "10 minutes ago"Look for connection errors, authentication failures, or Docker API errors.
Deployment Stuck in Scheduling
Symptom
A deployment remains in Scheduling state and never transitions to Running.
Solution
-
Check available nodes:
Open the Nodes page and verify that at least one node is in
Healthystate. If all nodes areOfflineorDraining, the scheduler has nowhere to place containers. -
Check resource availability:
Your deployment's CPU and memory requests may exceed what is available on any single node. Reduce the resource requests or add a node with more capacity.
-
Check resource quotas:
Your tenant may have a resource quota that has been reached. View your quota usage under Settings > Quotas.
-
Check node selectors:
If your deployment specifies node selectors or affinity rules, ensure at least one healthy node matches those constraints.
Container Keeps Restarting
Symptom
A container starts, runs briefly, then stops and restarts repeatedly. The deployment shows a high restart count.
Solution
-
Check container logs:
In the dashboard, navigate to the deployment and open the Logs tab. Look for application errors, missing environment variables, or failed database connections.
-
Verify the health check:
If your deployment defines a health check, make sure the endpoint exists and returns a success response. A failing health check causes the platform to restart the container.
# Example: verify your health endpoint works curl http://localhost:8080/health -
Check for OOM kills:
If the container is being killed for exceeding its memory limit, you will see
OOMKilledin the container events. Increase the memory limit in your deployment manifest or investigate your application's memory usage. -
Check image tag:
Verify you are deploying the correct image tag. A misconfigured or broken image will crash immediately on startup.
Canary Deployment Not Receiving Traffic
Symptom
You created a canary deployment, but the new version is not receiving any traffic. All requests go to the stable version.
Solution
-
Verify canary state:
The canary must be in
Runningstate before it receives traffic. Check the deployment status in the dashboard. -
Check traffic weight:
Navigate to the deployment's canary settings and verify the traffic weight is greater than 0%. A weight of 0% means the canary exists but receives no traffic.
-
Verify health checks pass:
Canary replicas must pass health checks before they are added to the load balancer rotation. Check that the canary containers are healthy.
-
Wait for propagation:
Traffic routing changes may take up to 30 seconds to propagate. If you just updated the weight, wait briefly and test again.
Autoscaling Not Working
Symptom
Autoscaling is configured but the deployment does not scale up under load, or does not scale down when idle.
Solution
-
Verify metrics are available:
Autoscaling requires Prometheus metrics. Check the Monitoring tab for your deployment. If no metrics appear, the metrics endpoint may be unreachable.
-
Check scaling bounds:
Ensure your minimum and maximum replica counts are set correctly. If
minequalsmax, autoscaling is effectively disabled. -
Check the target metric:
If you are using a custom metric, verify the metric name is correct and the metric is being emitted by your application.
-
Review cooldown period:
After a scaling event, there is a cooldown period (default: 5 minutes) before the next scaling decision. This prevents thrashing. If load changed recently, wait for the cooldown to expire.
CVE Scan Failing
Symptom
A vulnerability scan returns an error instead of results.
Solution
-
Verify the image exists:
The image must be accessible from the control plane. If using a private registry, ensure registry credentials are configured under Settings > Registries.
-
Check image size:
Very large images (over 5 GB) may cause scanner timeouts. Consider optimizing your image size with multi-stage builds.
-
Retry the scan:
Transient network errors can cause scan failures. Wait a moment and retry:
Click Scan again in the dashboard to retry.
-
Check scanner status:
View the platform status page to confirm the scanning service is operational.
Athena Not Responding
Symptom
Messages sent to Athena in the dashboard chat panel receive no response, or Athena returns an error message.
Solution
-
Check Athena status:
Navigate to Settings > Athena and verify the service is enabled and shows a
Connectedstatus. -
Rate limiting:
Athena has per-tenant rate limits. If you have sent many requests in a short period, wait a few minutes before retrying.
-
Verify API configuration:
If your tenant uses a custom AI API key, verify it is valid and has not expired under Settings > Athena.
-
Try a simpler query:
If complex queries fail, try a simple one like
"Show my deployments"to determine if the issue is with Athena connectivity or with a specific tool integration.
Authentication Errors (401)
Symptom
Dashboard actions return 401 Unauthorized or you are redirected to the login page.
Solution
-
Re-authenticate:
Sign out and sign back in to the dashboard. Tokens expire after a configurable period. Re-authenticating issues a fresh token.
-
Check token in API requests:
If using the API directly, ensure the
Authorizationheader includes a valid Bearer token:Authorization: Bearer <your-token> -
Verify your account is active:
Contact your tenant administrator to confirm your account has not been deactivated.
Permission Denied (403)
Symptom
You can authenticate successfully, but certain operations return 403 Forbidden.
Solution
-
Check your role:
Your RBAC role determines which actions you can perform. View your current role in the dashboard under your profile menu.
-
Role capabilities:
Role Capabilities Read-only View deployments, nodes, metrics, and logs Developer All Read-only permissions plus create/update deployments, manage secrets Operator All Developer permissions plus manage nodes, configure scaling, run scans Admin Full access including user management, RBAC, tenant settings -
Request a role change:
Contact your tenant administrator to adjust your role assignment if you need additional permissions.
Agent Upgrade Failed
Symptom
An agent upgrade was initiated but the node shows a Degraded or Rollback state.
Solution
-
Check agent logs:
sudo journalctl -u odysseus-agent --since "30 minutes ago"Look for image pull errors, permission issues, or startup failures.
-
Verify image accessibility:
The new agent image must be pullable from the node. Check that the node has network access to the container registry.
-
Automatic rollback:
Failed upgrades automatically roll back to the previous agent version. The node should return to
Healthystate after rollback. If it does not, restart the agent:sudo systemctl restart odysseus-agent -
Retry the upgrade:
After resolving the underlying issue, trigger the upgrade again from the dashboard under Nodes > [node] > Upgrade.
Secrets Not Injecting
Symptom
Your container starts but the expected secret files are missing from the mount path, or the files are empty.
Solution
-
Verify the secret path:
Check that the Vault path in your deployment manifest matches an existing secret. You can list available secrets from the dashboard under Secrets:
You can list available secrets from the dashboard under Secrets.
-
Check the key name:
The
keyfield must match a key within the secret. If the secret contains{"username": "admin", "password": "s3cret"}, usekey: "password"to inject just the password. -
Check permissions:
Your deployment's service identity must have a Vault policy that allows reading the specified secret path. Contact your administrator if you receive permission errors.
-
Inspect the container:
Check the mount path inside the running container:
Use the container shell feature in the dashboard to inspect the mount path.
High Memory Usage on Node
Symptom
A node shows high memory utilization in the dashboard, and containers may be getting OOM-killed.
Solution
-
Review deployment resource limits:
Check each deployment running on the node. Containers without memory limits can consume unbounded memory. Set explicit limits:
resources: limits: memory: "512Mi" requests: memory: "256Mi" -
Identify the offending container:
In the dashboard, navigate to the node and sort containers by memory usage to find which deployment is consuming the most memory.
-
Check for memory leaks:
If a container's memory usage grows continuously over time, your application may have a memory leak. Review application-level profiling.
-
Redistribute workloads:
If the node is overcommitted, add another node or adjust placement constraints to spread deployments across more nodes.
Getting Help
If the troubleshooting steps above do not resolve your issue, reach out through these support channels:
Athena (In-Dashboard AI Assistant)
For quick diagnostic help, ask Athena in the dashboard chat panel. Athena can check your deployment state, inspect logs, and suggest fixes in real time.
Documentation
Browse the full Odysseus documentation at docs.delta-telematics.ca/odysseus for detailed guides and tutorials.
Email Support
Contact the Delta Telematics support team at support@delta-telematics.ca. Include the following in your support request:
- Your tenant name
- The affected deployment or node name
- Timestamps of when the issue occurred
- Any error messages or codes received
- Steps you have already taken to troubleshoot
Status Page
Check the platform status page at status.delta-telematics.ca for ongoing incidents or scheduled maintenance that may affect your service.
X-Request-ID response header and allows the support team to trace your specific request through the system logs.