Kubernetes has confirmed that the Backoff Limit Per Index feature is now generally available (GA) in version 1.33, aimed at improving job management on the platform. This feature empowers users to specify a precise number of tolerable Pod failures for each index in an Indexed Job. This flexibility is vital for workloads that demand distinct handling of failures.
Previously, the spec.backoffLimit field in Kubernetes Jobs defined the total number of allowed failures across all indexes. This approach could lead to situations where a rapidly failing index exhausted the entire failure budget, blocking other indexes from executing. The Backoff Limit Per Index feature solves this issue by letting users set retries for each index independently via the spec.backoffLimitPerIndex field.
Additionally, the spec.maxFailedIndexes field lets users impose a cap on the total number of failed indexes. Exceeding this limit will terminate the entire Job, enhancing error handling and resource management. Furthermore, the FailIndex action in the Pod Failure Policy allows for defining conditions under which a specific index is marked as failed.
This update greatly benefits users operating parallel workloads, such as integration tests, where monitoring the success of individual tests is critical.
Kubernetes invites developers eager to contribute to future enhancements to connect with the community through their Slack channel and participate in regular meetings. For further information, users can refer to the official documentation about Backoff Limit Per Index and its synergy with the Pod Failure Policy.
For more details, visit the official blog post.