Troubleshooting

Common issues and their solutions. Each entry describes the problem, what you will observe, and the steps to resolve it.

Node Not Appearing in Dashboard

Symptom

You installed the Odysseus agent on a node, but the node does not appear in the dashboard under Nodes.

Solution

  1. Verify Docker is running:
    docker info

    If Docker is not running, start it with sudo systemctl start docker. The agent requires a running Docker daemon.

  2. Check WireGuard connectivity:
    sudo wg show

    You should see an active WireGuard interface with a recent handshake timestamp. If there is no handshake, check that your node can make outbound UDP connections to the control plane endpoint.

  3. Verify the enrollment token:

    Enrollment tokens expire after a configurable period (default: 24 hours). If your token has expired, generate a new one from Settings > Enrollment Tokens in the dashboard and re-run the agent installation.

  4. Check agent logs:
    sudo journalctl -u odysseus-agent --since "10 minutes ago"

    Look for connection errors, authentication failures, or Docker API errors.

Deployment Stuck in Scheduling

Symptom

A deployment remains in Scheduling state and never transitions to Running.

Solution

  1. Check available nodes:

    Open the Nodes page and verify that at least one node is in Healthy state. If all nodes are Offline or Draining, the scheduler has nowhere to place containers.

  2. Check resource availability:

    Your deployment's CPU and memory requests may exceed what is available on any single node. Reduce the resource requests or add a node with more capacity.

  3. Check resource quotas:

    Your tenant may have a resource quota that has been reached. View your quota usage under Settings > Quotas.

  4. Check node selectors:

    If your deployment specifies node selectors or affinity rules, ensure at least one healthy node matches those constraints.

Container Keeps Restarting

Symptom

A container starts, runs briefly, then stops and restarts repeatedly. The deployment shows a high restart count.

Solution

  1. Check container logs:

    In the dashboard, navigate to the deployment and open the Logs tab. Look for application errors, missing environment variables, or failed database connections.

  2. Verify the health check:

    If your deployment defines a health check, make sure the endpoint exists and returns a success response. A failing health check causes the platform to restart the container.

    # Example: verify your health endpoint works
    curl http://localhost:8080/health
  3. Check for OOM kills:

    If the container is being killed for exceeding its memory limit, you will see OOMKilled in the container events. Increase the memory limit in your deployment manifest or investigate your application's memory usage.

  4. Check image tag:

    Verify you are deploying the correct image tag. A misconfigured or broken image will crash immediately on startup.

Canary Deployment Not Receiving Traffic

Symptom

You created a canary deployment, but the new version is not receiving any traffic. All requests go to the stable version.

Solution

  1. Verify canary state:

    The canary must be in Running state before it receives traffic. Check the deployment status in the dashboard.

  2. Check traffic weight:

    Navigate to the deployment's canary settings and verify the traffic weight is greater than 0%. A weight of 0% means the canary exists but receives no traffic.

  3. Verify health checks pass:

    Canary replicas must pass health checks before they are added to the load balancer rotation. Check that the canary containers are healthy.

  4. Wait for propagation:

    Traffic routing changes may take up to 30 seconds to propagate. If you just updated the weight, wait briefly and test again.

Autoscaling Not Working

Symptom

Autoscaling is configured but the deployment does not scale up under load, or does not scale down when idle.

Solution

  1. Verify metrics are available:

    Autoscaling requires Prometheus metrics. Check the Monitoring tab for your deployment. If no metrics appear, the metrics endpoint may be unreachable.

  2. Check scaling bounds:

    Ensure your minimum and maximum replica counts are set correctly. If min equals max, autoscaling is effectively disabled.

  3. Check the target metric:

    If you are using a custom metric, verify the metric name is correct and the metric is being emitted by your application.

  4. Review cooldown period:

    After a scaling event, there is a cooldown period (default: 5 minutes) before the next scaling decision. This prevents thrashing. If load changed recently, wait for the cooldown to expire.

CVE Scan Failing

Symptom

A vulnerability scan returns an error instead of results.

Solution

  1. Verify the image exists:

    The image must be accessible from the control plane. If using a private registry, ensure registry credentials are configured under Settings > Registries.

  2. Check image size:

    Very large images (over 5 GB) may cause scanner timeouts. Consider optimizing your image size with multi-stage builds.

  3. Retry the scan:

    Transient network errors can cause scan failures. Wait a moment and retry:

    Click Scan again in the dashboard to retry.

  4. Check scanner status:

    View the platform status page to confirm the scanning service is operational.

Athena Not Responding

Symptom

Messages sent to Athena in the dashboard chat panel receive no response, or Athena returns an error message.

Solution

  1. Check Athena status:

    Navigate to Settings > Athena and verify the service is enabled and shows a Connected status.

  2. Rate limiting:

    Athena has per-tenant rate limits. If you have sent many requests in a short period, wait a few minutes before retrying.

  3. Verify API configuration:

    If your tenant uses a custom AI API key, verify it is valid and has not expired under Settings > Athena.

  4. Try a simpler query:

    If complex queries fail, try a simple one like "Show my deployments" to determine if the issue is with Athena connectivity or with a specific tool integration.

Authentication Errors (401)

Symptom

Dashboard actions return 401 Unauthorized or you are redirected to the login page.

Solution

  1. Re-authenticate:

    Sign out and sign back in to the dashboard. Tokens expire after a configurable period. Re-authenticating issues a fresh token.

  2. Check token in API requests:

    If using the API directly, ensure the Authorization header includes a valid Bearer token:

    Authorization: Bearer <your-token>
  3. Verify your account is active:

    Contact your tenant administrator to confirm your account has not been deactivated.

Permission Denied (403)

Symptom

You can authenticate successfully, but certain operations return 403 Forbidden.

Solution

  1. Check your role:

    Your RBAC role determines which actions you can perform. View your current role in the dashboard under your profile menu.

  2. Role capabilities:
    Role Capabilities
    Read-only View deployments, nodes, metrics, and logs
    Developer All Read-only permissions plus create/update deployments, manage secrets
    Operator All Developer permissions plus manage nodes, configure scaling, run scans
    Admin Full access including user management, RBAC, tenant settings
  3. Request a role change:

    Contact your tenant administrator to adjust your role assignment if you need additional permissions.

Agent Upgrade Failed

Symptom

An agent upgrade was initiated but the node shows a Degraded or Rollback state.

Solution

  1. Check agent logs:
    sudo journalctl -u odysseus-agent --since "30 minutes ago"

    Look for image pull errors, permission issues, or startup failures.

  2. Verify image accessibility:

    The new agent image must be pullable from the node. Check that the node has network access to the container registry.

  3. Automatic rollback:

    Failed upgrades automatically roll back to the previous agent version. The node should return to Healthy state after rollback. If it does not, restart the agent:

    sudo systemctl restart odysseus-agent
  4. Retry the upgrade:

    After resolving the underlying issue, trigger the upgrade again from the dashboard under Nodes > [node] > Upgrade.

Secrets Not Injecting

Symptom

Your container starts but the expected secret files are missing from the mount path, or the files are empty.

Solution

  1. Verify the secret path:

    Check that the Vault path in your deployment manifest matches an existing secret. You can list available secrets from the dashboard under Secrets:

    You can list available secrets from the dashboard under Secrets.

  2. Check the key name:

    The key field must match a key within the secret. If the secret contains {"username": "admin", "password": "s3cret"}, use key: "password" to inject just the password.

  3. Check permissions:

    Your deployment's service identity must have a Vault policy that allows reading the specified secret path. Contact your administrator if you receive permission errors.

  4. Inspect the container:

    Check the mount path inside the running container:

    Use the container shell feature in the dashboard to inspect the mount path.

High Memory Usage on Node

Symptom

A node shows high memory utilization in the dashboard, and containers may be getting OOM-killed.

Solution

  1. Review deployment resource limits:

    Check each deployment running on the node. Containers without memory limits can consume unbounded memory. Set explicit limits:

    resources:
      limits:
        memory: "512Mi"
      requests:
        memory: "256Mi"
  2. Identify the offending container:

    In the dashboard, navigate to the node and sort containers by memory usage to find which deployment is consuming the most memory.

  3. Check for memory leaks:

    If a container's memory usage grows continuously over time, your application may have a memory leak. Review application-level profiling.

  4. Redistribute workloads:

    If the node is overcommitted, add another node or adjust placement constraints to spread deployments across more nodes.

Getting Help

If the troubleshooting steps above do not resolve your issue, reach out through these support channels:

Athena (In-Dashboard AI Assistant)

For quick diagnostic help, ask Athena in the dashboard chat panel. Athena can check your deployment state, inspect logs, and suggest fixes in real time.

Documentation

Browse the full Odysseus documentation at docs.delta-telematics.ca/odysseus for detailed guides and tutorials.

Email Support

Contact the Delta Telematics support team at support@delta-telematics.ca. Include the following in your support request:

  • Your tenant name
  • The affected deployment or node name
  • Timestamps of when the issue occurred
  • Any error messages or codes received
  • Steps you have already taken to troubleshoot

Status Page

Check the platform status page at status.delta-telematics.ca for ongoing incidents or scheduled maintenance that may affect your service.

Tip: When contacting support, include the request ID from API error responses. This ID is returned in the X-Request-ID response header and allows the support team to trace your specific request through the system logs.