Describe the bug
When default-image-check-pull-policy is set to IfNotPresent and no explicit imagePullPolicy is set on a container using a tagged image, the imagecheck init container and the app container resolve to different pull policies. The imagecheck passes using a stale cached image on the node, but the app container uses Always (the default for tagged images) and fails with ImagePullBackOff when the image is no longer available in the registry.
This defeats the purpose of the imagecheck preflight: the check passes, all init containers succeed, the pod transitions to Running, and the agent acquires the job. But container-0 never starts because its image pull fails. The job then hangs silently with no logs until manual cancellation or a pipeline timeout.
The root cause is in selectImagesToCheck (internal/controller/scheduler/imagecheck.go), where two separate pull policies are computed for the same image:
// Imagecheck init container policy ("nominalPolicy")
nominalPolicy := cmp.Or(
c.ImagePullPolicy, // user-set policy on the container
w.cfg.DefaultImageCheckPullPolicy, // controller config — IfNotPresent
w.cfg.DefaultImagePullPolicy, // controller config
defaultPullPolicyForImage(ref), // Always for tagged, IfNotPresent for digest
)
// App container policy
c.ImagePullPolicy = cmp.Or(
c.ImagePullPolicy, // user-set policy on the container
w.cfg.DefaultImagePullPolicy, // controller config
defaultPullPolicyForImage(ref), // Always for tagged, IfNotPresent for digest
)
nominalPolicy includes DefaultImageCheckPullPolicy in its cmp.Or chain but the app container policy does not. cmp.Or returns the first non-zero value, so when DefaultImageCheckPullPolicy is set, the two policies can diverge.
There is a safeguard that downgrades the app container from Always to IfNotPresent when both policies are Always:
if nominalPolicy == corev1.PullAlways && c.ImagePullPolicy == corev1.PullAlways {
c.ImagePullPolicy = corev1.PullIfNotPresent
}
This safeguard does not activate when the policies differ (e.g. nominalPolicy is IfNotPresent and c.ImagePullPolicy is Always), which is the exact configuration that triggers this bug.
To Reproduce
- Deploy the controller with
default-image-check-pull-policy: IfNotPresent and no default-image-pull-policy set
- Run a job that references a tagged container image (e.g.
my-app:v1.2.3) with no explicit imagePullPolicy on the container
- Ensure the image is cached on the node from a previous run
- Remove the image from the container registry
- Trigger the job
The imagecheck init container passes using the cached image (IfNotPresent). The app container (container-0) attempts a fresh pull (Always), fails with ImagePullBackOff, and the job hangs with no output.
Expected behavior
The imagecheck init container should use a pull policy at least as strong as the app container's policy. If the app container will pull with Always, the imagecheck should also pull with Always, so that a missing or unavailable image is caught during the preflight check rather than after the agent has acquired the job.
Environment
- agent-stack-k8s version: v0.32.2 (confirmed), also present on
main at HEAD
- Kubernetes version: any
- Deployment method: any
Logs
No job logs are produced. The agent acquires the job but all containers block on the startup check (kubernetes/runner.go:startupCheck), waiting for container-0 to register via the socket. Since container-0 never starts, no bootstrap phases execute and no logs are written.
Pod events show repeated ImagePullBackOff on container-0:
Warning Failed <timestamp> kubelet Failed to pull image "my-app:v1.2.3": ...
Warning Failed <timestamp> kubelet Error: ImagePullBackOff
Additional context
The default pull policy for tagged images is Always, matching Kubernetes' own default (tags are mutable). This is computed by defaultPullPolicyForImage in internal/controller/scheduler/scheduler.go:
func defaultPullPolicyForImage(ref reference.Reference) corev1.PullPolicy {
if _, hasDigest := ref.(reference.Digested); hasDigest {
return corev1.PullIfNotPresent
}
return corev1.PullAlways
}
The existing pullPolicyPreference ranking (Always > IfNotPresent > Never) is only applied when multiple containers share the same image, not when comparing the imagecheck against its own app container.
This bug interacts with a separate pod watcher gap where ImagePullBackOff on a running pod is not acted upon (the pod watcher defers to the agent for jobs in Running state, but the agent cannot observe sibling container image pull failures). Together, these two issues cause the job to hang indefinitely.
Describe the bug
When
default-image-check-pull-policyis set toIfNotPresentand no explicitimagePullPolicyis set on a container using a tagged image, the imagecheck init container and the app container resolve to different pull policies. The imagecheck passes using a stale cached image on the node, but the app container usesAlways(the default for tagged images) and fails withImagePullBackOffwhen the image is no longer available in the registry.This defeats the purpose of the imagecheck preflight: the check passes, all init containers succeed, the pod transitions to
Running, and the agent acquires the job. Butcontainer-0never starts because its image pull fails. The job then hangs silently with no logs until manual cancellation or a pipeline timeout.The root cause is in
selectImagesToCheck(internal/controller/scheduler/imagecheck.go), where two separate pull policies are computed for the same image:nominalPolicyincludesDefaultImageCheckPullPolicyin itscmp.Orchain but the app container policy does not.cmp.Orreturns the first non-zero value, so whenDefaultImageCheckPullPolicyis set, the two policies can diverge.There is a safeguard that downgrades the app container from
AlwaystoIfNotPresentwhen both policies areAlways:This safeguard does not activate when the policies differ (e.g.
nominalPolicyisIfNotPresentandc.ImagePullPolicyisAlways), which is the exact configuration that triggers this bug.To Reproduce
default-image-check-pull-policy: IfNotPresentand nodefault-image-pull-policysetmy-app:v1.2.3) with no explicitimagePullPolicyon the containerThe imagecheck init container passes using the cached image (
IfNotPresent). The app container (container-0) attempts a fresh pull (Always), fails withImagePullBackOff, and the job hangs with no output.Expected behavior
The imagecheck init container should use a pull policy at least as strong as the app container's policy. If the app container will pull with
Always, the imagecheck should also pull withAlways, so that a missing or unavailable image is caught during the preflight check rather than after the agent has acquired the job.Environment
mainat HEADLogs
No job logs are produced. The agent acquires the job but all containers block on the startup check (
kubernetes/runner.go:startupCheck), waiting forcontainer-0to register via the socket. Sincecontainer-0never starts, no bootstrap phases execute and no logs are written.Pod events show repeated
ImagePullBackOffoncontainer-0:Additional context
The default pull policy for tagged images is
Always, matching Kubernetes' own default (tags are mutable). This is computed bydefaultPullPolicyForImageininternal/controller/scheduler/scheduler.go:The existing
pullPolicyPreferenceranking (Always > IfNotPresent > Never) is only applied when multiple containers share the same image, not when comparing the imagecheck against its own app container.This bug interacts with a separate pod watcher gap where
ImagePullBackOffon a running pod is not acted upon (the pod watcher defers to the agent for jobs inRunningstate, but the agent cannot observe sibling container image pull failures). Together, these two issues cause the job to hang indefinitely.