Part of the OSS mirrord flow is creating a k8s Job running a Pod with the mirrord-agent container (here).
Starting that pod might fail for multiple reasons. Our logic there is not the best, we basically only wait until status.phase == "Running". When the agent pod cannot be spawned, most of the time this results in a generic timeout error (the timeout is enforced somewhere up the call stack).
We should:
- Fail early if the agent pod moves to
Failed phase. This can happen due to cluster conditions. We should extract status.reason and status.message and include them in the error message presented to the user.
- Fail early if the agent pod moves to
Succeeded phase. This should never happen, and should be reported to the user as a bug.
- Fail early if the agent pod is deleted while in
Pending phase. This usually means that the user does not have sufficient permissions to spawn the agent pod in the cluster. The error message presented to the user should mention Pod Security Admission as a probable cause of the failure, and suggest trying out mirrord for Teams (similar to this).
- For every 10s while the agent pod is stuck in the
Pending phase, we should issue a Progress::warning. The warning should state that the agent pod startup takes longer than expected, and contain info about status.containerStatuses.[].state of the agent container. See container states for reference.
Part of the OSS mirrord flow is creating a k8s Job running a Pod with the mirrord-agent container (here).
Starting that pod might fail for multiple reasons. Our logic there is not the best, we basically only wait until
status.phase == "Running". When the agent pod cannot be spawned, most of the time this results in a generic timeout error (the timeout is enforced somewhere up the call stack).We should:
Failedphase. This can happen due to cluster conditions. We should extractstatus.reasonandstatus.messageand include them in the error message presented to the user.Succeededphase. This should never happen, and should be reported to the user as a bug.Pendingphase. This usually means that the user does not have sufficient permissions to spawn the agent pod in the cluster. The error message presented to the user should mention Pod Security Admission as a probable cause of the failure, and suggest trying out mirrord for Teams (similar to this).Pendingphase, we should issue aProgress::warning. The warning should state that the agent pod startup takes longer than expected, and contain info aboutstatus.containerStatuses.[].stateof the agent container. See container states for reference.