Skip to content

Improve UX when spawning the agent pod fails #3578

@Razz4780

Description

@Razz4780

Part of the OSS mirrord flow is creating a k8s Job running a Pod with the mirrord-agent container (here).

Starting that pod might fail for multiple reasons. Our logic there is not the best, we basically only wait until status.phase == "Running". When the agent pod cannot be spawned, most of the time this results in a generic timeout error (the timeout is enforced somewhere up the call stack).

We should:

  1. Fail early if the agent pod moves to Failed phase. This can happen due to cluster conditions. We should extract status.reason and status.message and include them in the error message presented to the user.
  2. Fail early if the agent pod moves to Succeeded phase. This should never happen, and should be reported to the user as a bug.
  3. Fail early if the agent pod is deleted while in Pending phase. This usually means that the user does not have sufficient permissions to spawn the agent pod in the cluster. The error message presented to the user should mention Pod Security Admission as a probable cause of the failure, and suggest trying out mirrord for Teams (similar to this).
  4. For every 10s while the agent pod is stuck in the Pending phase, we should issue a Progress::warning. The warning should state that the agent pod startup takes longer than expected, and contain info about status.containerStatuses.[].state of the agent container. See container states for reference.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions