Gap with imagePullPolicy coverage for container-0 misses ImagePullBackOff errors

## Describe the bug

When `default-image-check-pull-policy` is set to `IfNotPresent` and no explicit `imagePullPolicy` is set on a container using a tagged image, the imagecheck init container and the app container resolve to different pull policies. The imagecheck passes using a stale cached image on the node, but the app container uses `Always` (the default for tagged images) and fails with `ImagePullBackOff` when the image is no longer available in the registry.

This defeats the purpose of the imagecheck preflight: the check passes, all init containers succeed, the pod transitions to `Running`, and the agent acquires the job. But `container-0` never starts because its image pull fails. The job then hangs silently with no logs until manual cancellation or a pipeline timeout.

The root cause is in `selectImagesToCheck` (`internal/controller/scheduler/imagecheck.go`), where two separate pull policies are computed for the same image:

```go
// Imagecheck init container policy ("nominalPolicy")
nominalPolicy := cmp.Or(
    c.ImagePullPolicy,                  // user-set policy on the container
    w.cfg.DefaultImageCheckPullPolicy,  // controller config — IfNotPresent
    w.cfg.DefaultImagePullPolicy,       // controller config
    defaultPullPolicyForImage(ref),     // Always for tagged, IfNotPresent for digest
)

// App container policy
c.ImagePullPolicy = cmp.Or(
    c.ImagePullPolicy,                  // user-set policy on the container
    w.cfg.DefaultImagePullPolicy,       // controller config
    defaultPullPolicyForImage(ref),     // Always for tagged, IfNotPresent for digest
)
```

`nominalPolicy` includes `DefaultImageCheckPullPolicy` in its `cmp.Or` chain but the app container policy does not. `cmp.Or` returns the first non-zero value, so when `DefaultImageCheckPullPolicy` is set, the two policies can diverge.

There is a safeguard that downgrades the app container from `Always` to `IfNotPresent` when both policies are `Always`:

```go
if nominalPolicy == corev1.PullAlways && c.ImagePullPolicy == corev1.PullAlways {
    c.ImagePullPolicy = corev1.PullIfNotPresent
}
```

This safeguard does not activate when the policies differ (e.g. `nominalPolicy` is `IfNotPresent` and `c.ImagePullPolicy` is `Always`), which is the exact configuration that triggers this bug.

## To Reproduce

1. Deploy the controller with `default-image-check-pull-policy: IfNotPresent` and no `default-image-pull-policy` set
2. Run a job that references a tagged container image (e.g. `my-app:v1.2.3`) with no explicit `imagePullPolicy` on the container
3. Ensure the image is cached on the node from a previous run
4. Remove the image from the container registry
5. Trigger the job

The imagecheck init container passes using the cached image (`IfNotPresent`). The app container (`container-0`) attempts a fresh pull (`Always`), fails with `ImagePullBackOff`, and the job hangs with no output.

## Expected behavior

The imagecheck init container should use a pull policy at least as strong as the app container's policy. If the app container will pull with `Always`, the imagecheck should also pull with `Always`, so that a missing or unavailable image is caught during the preflight check rather than after the agent has acquired the job.

## Environment

- agent-stack-k8s version: v0.32.2 (confirmed), also present on `main` at HEAD
- Kubernetes version: any
- Deployment method: any

## Logs

No job logs are produced. The agent acquires the job but all containers block on the startup check (`kubernetes/runner.go:startupCheck`), waiting for `container-0` to register via the socket. Since `container-0` never starts, no bootstrap phases execute and no logs are written.

Pod events show repeated `ImagePullBackOff` on `container-0`:

```
Warning  Failed     <timestamp>  kubelet  Failed to pull image "my-app:v1.2.3": ...
Warning  Failed     <timestamp>  kubelet  Error: ImagePullBackOff
```

## Additional context

The default pull policy for tagged images is `Always`, matching Kubernetes' own default (tags are mutable). This is computed by `defaultPullPolicyForImage` in `internal/controller/scheduler/scheduler.go`:

```go
func defaultPullPolicyForImage(ref reference.Reference) corev1.PullPolicy {
    if _, hasDigest := ref.(reference.Digested); hasDigest {
        return corev1.PullIfNotPresent
    }
    return corev1.PullAlways
}
```

The existing `pullPolicyPreference` ranking (`Always > IfNotPresent > Never`) is only applied when multiple containers share the same image, not when comparing the imagecheck against its own app container.

This bug interacts with a separate pod watcher gap where `ImagePullBackOff` on a running pod is not acted upon (the pod watcher defers to the agent for jobs in `Running` state, but the agent cannot observe sibling container image pull failures). Together, these two issues cause the job to hang indefinitely.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gap with imagePullPolicy coverage for container-0 misses ImagePullBackOff errors #846

Describe the bug

To Reproduce

Expected behavior

Environment

Logs

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gap with imagePullPolicy coverage for container-0 misses ImagePullBackOff errors #846

Description

Describe the bug

To Reproduce

Expected behavior

Environment

Logs

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions