About Me

My photo
I am MCSE in Data Management and Analytics with specialization in MS SQL Server and MCP in Azure. I have over 13+ years of experience in IT industry with expertise in data management, Azure Cloud, Data-Canter Migration, Infrastructure Architecture planning and Virtualization and automation. Contact me if you are looking for any sort of guidance in getting your Infrastructure provisioning automated through Terraform. I sometime write for a place to store my own experiences for future search and read by own blog but can hopefully help others along the way. Thanks.

Understanding Pod Disruption Budgets and their impact on AKS Cluster Upgrades

AKS Cluster upgrade failed with the Error, listed below..  

Pod eviction operations have failed with Eviction Failure errors. Drain operations are performed when there is a need to safely evict pods from a specific node during operations such as:

  • Draining a node for repair or upgrade.
  • Cluster autoscaler removing a node from a node pool.
  • Deleting a node pool.

During cluster or agent pool upgrades, before a node can be upgraded, its workloads are scheduled onto another node to maintain workload availability.

Drain marks a node as going out of service and as such all of the pods on the node are evicted. The operation periodically retries all failed requests until all pods on the node are terminated, or until a timeout is reached. Drain failures generally indicate a timeout was reached and the pods were not able to be evicted from the node within the timeout period.

The main reasons for a drain failure would be:

  • Drain simply took longer than expected.
  • Incorrectly configured pod disruption budgets.
  • Slow disk attach or detach.



 What is pod disruption budget and how does it impacts AKS cluster upgrade:-

Ans:-

Pod Disruption Budget (PDB) is a Kubernetes feature that allows you to specify the minimum number of pods of a certain type that must be available during a disruption event, such as a node upgrade, a rolling deployment, or a network outage. By setting a PDB, you can ensure that your application remains available and responsive during these events, while minimizing the risk of data loss or downtime.

In an AKS cluster upgrade scenario, PDBs can impact the upgrade process in a few ways. When you upgrade an AKS cluster, the nodes are upgraded one by one, and during the upgrade, the nodes are cordoned off, meaning no new pods can be scheduled on them, and existing pods are gracefully evicted and rescheduled on other nodes.

Here are some ways PDBs can impact AKS cluster upgrades:

Ensuring Availability: If you have a PDB set for a deployment or a StatefulSet, the AKS upgrade process will ensure that the minimum number of pods specified in the PDB is maintained during the upgrade. This ensures that the application remains available during the upgrade, even if some of the nodes are cordoned off or unavailable.

Upgrade Speed: If you have a strict PDB set, the upgrade process may take longer to complete, as the AKS cluster upgrade process will wait until the minimum number of pods specified in the PDB are available on the new nodes before moving on to the next node.

Risk Mitigation: If you do not have a PDB set, or if the PDB is set too low, there is a risk that during the upgrade process, some pods may be evicted or terminated, which can result in data loss or downtime.

Overall, PDBs play an important role in ensuring the availability and stability of applications during AKS cluster upgrades. By setting a PDB, you can minimize the impact of node disruptions and ensure that your application remains available and responsive during the upgrade process.

Q:- How to check how many pod disruption budget are set in your AKS cluster before upgrade:-

Ans:-

kubectl get pdb --all-namespaces

output:-

NAMESPACE       NAME                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE

calico-system   calico-typha         N/A             1                 1                     11d

ingress-basic   my-nginx-pdb         2               N/A               0                     6d1h

kube-system     coredns-pdb          1               N/A               1                     11d

kube-system     konnectivity-agent   1               N/A               1                     11d

kube-system     metrics-server-pdb   1               N/A               1                     11d

rak                       my-app-pdb           3               N/A               0                     6d2h

Explanation:-

The output is showing the Pod Disruption Budgets (PDBs) for various namespaces in the AKS cluster. Each row in the output corresponds to a single PDB and contains the following information:

NAMESPACE: The namespace in which the PDB is defined.

NAME: The name of the PDB.

MIN AVAILABLE: The minimum number of replicas that must be available during a disruption. This is the minimum number of pods that must be running for the PDB to be satisfied.

MAX UNAVAILABLE: The maximum number of replicas that can be unavailable during a disruption. This is the maximum number of pods that can be unavailable for the PDB to be satisfied.

ALLOWED DISRUPTIONS: The total number of allowed disruptions for the PDB.

AGE: The age of the PDB, i.e., the amount of time since it was created.

Let's take an example of the rak namespace and the my-app-pdb PDB. 

It has a minimum availability of 3, which means that during any disruption, at least 3 replicas of the application must be available. The MAX UNAVAILABLE is set to N/A, which means that during a disruption, there can be no more than 0 replicas of the application that can be unavailable. 

The ALLOWED DISRUPTIONS are set to 0, which means that there are no allowed disruptions for this PDB. 

This means that if the AKS cluster upgrade requires a node to be drained, the Kubernetes control plane will ensure that at least 3 replicas of the application are running on other nodes before draining the node with the application pod. 

If there are not enough nodes to maintain the minimum availability, the node will not be drained, and the upgrade will be postponed until enough nodes are available.

for some of pdb like my-nginx-pdb and my-app-pdb, the  Allowed disruption is set to 0. it means 

that no pods can be safely evicted during maintenance or upgrade operations, as any disruption to the pods would result in a violation of the PDB constraints. This can impact the availability and resiliency of the application running on the cluster.

During an AKS cluster upgrade, the upgrade process checks the PDBs to determine if the upgrade can proceed without violating any constraints. If the "Allowed Disruptions" value is set to 0, then the upgrade will not proceed until the PDB constraint is updated or removed.

hence the upgrade will always fail.

Similarly, for the ingress-basic namespace and the my-nginx-pdb PDB, the minimum availability is set to 2, and the maximum unavailability is set to N/A, which means that during a disruption, there can be no more than 0 replicas of the application that can be unavailable. The ALLOWED DISRUPTIONS are set to 0, which means that there are no allowed disruptions for this PDB. This PDB ensures that during an upgrade, at least 2 replicas of the my-nginx application are running at all times.

 If there are not enough nodes to maintain the minimum availability, the upgrade will be postponed until enough nodes are available.

In summary, PDBs are used to ensure high availability of applications during node maintenance, upgrades, and other disruptions. By specifying the minimum and maximum number of replicas that must be available during a disruption, PDBs ensure that applications are always available to users.

Q: What to do if ALLOWED disruptions are set to 0 for pdbs in your AKS Cluster and you are trying to upgrade AKS Cluster

Ans:- The upgrade of AKS Cluster will fail. reason is 

The AKS cluster upgrade process takes into consideration the PDBs defined in the cluster, and if the upgrade process violates any of the PDBs, the upgrade process will fail. The upgrade process ensures that it maintains the minimum number of available replicas specified in the PDB while updating the AKS cluster.

In the output provided, some of the PDBs have a minimum available replica count of 1 or more, while some have a minimum of "N/A." This means that during the AKS cluster upgrade, the upgrade process must ensure that there are at least the specified number of available replicas for those pods, or in the case of "N/A," it will default to ensuring that there is at least one available replica.

How to resolve this error.

1. Before upgrade take a backup of specific PDB whose ALLOWED disruption is set to 0

 kubectl get pdb my-nginx-pdb -n ingress-basic -o yaml > my-nginx-pdb.yaml
 kubectl get pdb my-app-pdb -n rak -o yaml > my-app-pdb.yaml 

2. Delete the PDB

     kubectl delete pdb my-nginx-pdb -n ingress-basic
     kubectl delete pdb my-app-pdb -n rak

3. Perform Version upgrade of AKS

C:\Users\kusha>az aks upgrade --resource-group RGP-USE-AKS-DV --name AKS-USE-AKS-DEV --kubernetes-version 1.26.0

Kubernetes may be unavailable during cluster upgrades.

 Are you sure you want to perform this operation? (y/N): y

Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version 1.26.0. Continue? (y/N): y

4. Restore  PDB

    kubectl apply -f  my-nginx-pdb.yaml -n ingress-basic 
   kubectl apply -f my-app-pdb -n rak

Thats it..


some objective questions for practice:-

Certainly! I apologize for the confusion. I'll generate a new set of questions without indicating the correct answers. Feel free to go through the quiz, and if you have any questions or need the correct answers, let me know.


**Pod Disruption Budget and AKS Cluster Upgrade Quiz**


1. What is the primary purpose of a Pod Disruption Budget (PDB) in Kubernetes?

   - A. Managing networking configurations during upgrades

   - B. Ensuring availability of pods during disruptions

   - C. Handling persistent storage for applications

   - D. Facilitating communication between nodes


2. How does a Pod Disruption Budget impact the AKS cluster upgrade process?

   - A. Accelerates the upgrade speed

   - B. Has no impact on the upgrade process

   - C. Ensures minimum pod availability during the upgrade

   - D. Manages storage configurations


3. During an AKS cluster upgrade, why might a strict Pod Disruption Budget cause the upgrade process to take longer?

   - A. Due to increased network latency

   - B. Ensuring availability of minimum specified pods

   - C. Facilitating external access to services

   - D. Managing persistent storage volumes


4. What is the role of a Pod Disruption Budget when it comes to risk mitigation during an AKS cluster upgrade?

   - A. Speeding up the upgrade process

   - B. Ensuring maximum pod unavailability

   - C. Minimizing the risk of data loss or downtime

   - D. Facilitating communication between nodes


5. How can you check the number of Pod Disruption Budgets set in your AKS cluster before an upgrade?

   - A. `kubectl get pdbs -A`

   - B. `az aks show-pdbs --resource-group <resource-group> --name <aks-cluster-name>`

   - C. `kubectl get pdb --all-namespaces`

   - D. `aks-pdb check`


6. What information does the output of `kubectl get pdb --all-namespaces` provide about Pod Disruption Budgets in an AKS cluster?

   - A. Network latency details

   - B. Pod unavailability statistics

   - C. PDB constraints and age

   - D. Persistent storage usage


7. In the context of Pod Disruption Budgets, what does "ALLOWED DISRUPTIONS" represent in the output of `kubectl get pdb`?

   - A. Maximum allowed pod disruptions

   - B. The total number of allowed disruptions for the PDB

   - C. Minimum allowed pod disruptions

   - D. Disruptions caused by network failures


8. If the "ALLOWED DISRUPTIONS" for a Pod Disruption Budget is set to 0 during an AKS cluster upgrade, what is the likely impact?

   - A. No impact on the upgrade process

   - B. The upgrade will proceed without considering the PDB

   - C. The upgrade will fail if any pod disruption is required

   - D. The upgrade speed will increase


9. How can you resolve an upgrade failure due to a Pod Disruption Budget with "ALLOWED DISRUPTIONS" set to 0?

   - A. Increase the "ALLOWED DISRUPTIONS" value

   - B. Delete the PDB and perform the upgrade

   - C. Ignore the error and continue with the upgrade

   - D. Reconfigure the PDB during the upgrade


10. What information does the command `kubectl get pdb <pdb-name> -n <namespace> -o yaml` provide, and how is it useful during an AKS upgrade?

   - A. Detailed pod resource utilization

   - B. PDB constraints and age

   - C. Configuration backup of the specified PDB

   - D. Network latency statistics


Ans:- 

**Pod Disruption Budget and AKS Cluster Upgrade Quiz**


1. What is the primary purpose of a Pod Disruption Budget (PDB) in Kubernetes?

   - A. Managing networking configurations during upgrades

   - B. Ensuring availability of pods during disruptions

   - C. Handling persistent storage for applications

   - D. Facilitating communication between nodes


2. How does a Pod Disruption Budget impact the AKS cluster upgrade process?

   - A. Accelerates the upgrade speed

   - B. Has no impact on the upgrade process

   - C. Ensures minimum pod availability during the upgrade 

   - D. Manages storage configurations


3. During an AKS cluster upgrade, why might a strict Pod Disruption Budget cause the upgrade process to take longer?

   - A. Due to increased network latency

   - B. Ensuring availability of minimum specified pods

   - C. Facilitating external access to services

   - D. Managing persistent storage volumes


4. What is the role of a Pod Disruption Budget when it comes to risk mitigation during an AKS cluster upgrade?

   - A. Speeding up the upgrade process

   - B. Ensuring maximum pod unavailability

   - C. Minimizing the risk of data loss or downtime

   - D. Facilitating communication between nodes


5. How can you check the number of Pod Disruption Budgets set in your AKS cluster before an upgrade?

   - A. `kubectl get pdbs -A`

   - B. `az aks show-pdbs --resource-group <resource-group> --name <aks-cluster-name>`

   - C. `kubectl get pdb --all-namespaces

   - D. `aks-pdb check`


6. What information does the output of `kubectl get pdb --all-namespaces` provide about Pod Disruption Budgets in an AKS cluster?

   - A. Network latency details

   - B. Pod unavailability statistics

   - C. PDB constraints and age 

   - D. Persistent storage usage


7. In the context of Pod Disruption Budgets, what does "ALLOWED DISRUPTIONS" represent in the output of `kubectl get pdb`?

   - A. Maximum allowed pod disruptions

   - B. The total number of allowed disruptions for the PDB

   - C. Minimum allowed pod disruptions

   - D. Disruptions caused by network failures


8. If the "ALLOWED DISRUPTIONS" for a Pod Disruption Budget is set to 0 during an AKS cluster upgrade, what is the likely impact?

   - A. No impact on the upgrade process

   - B. The upgrade will proceed without considering the PDB

   - C. The upgrade will fail if any pod disruption is required 

   - D. The upgrade speed will increase


9. How can you resolve an upgrade failure due to a Pod Disruption Budget with "ALLOWED DISRUPTIONS" set to 0?

   - A. Increase the "ALLOWED DISRUPTIONS" value

   - B. Delete the PDB and perform the upgrade 

   - C. Ignore the error and continue with the upgrade

   - D. Reconfigure the PDB during the upgrade


10. What information does the command `kubectl get pdb <pdb-name> -n <namespace> -o yaml` provide, and how is it useful during an AKS upgrade?

   - A. Detailed pod resource utilization

   - B. PDB constraints and age

   - C. Configuration backup of the specified PDB 

   - D. Network latency statistics


---


**Answers:**

       1. B

       2. C

       3. B

       4. C

       5. C

       6. C

       7. B

       8. C

       9. B

      10. C


Q:- Examine the scenario where a Pod Disruption Budget (PDB) has "ALLOWED DISRUPTIONS" set to a value greater than 0. What impact does this have on the AKS cluster upgrade process, and under what circumstances might it be beneficial?

Ans:- 

In a scenario where a Pod Disruption Budget (PDB) has "ALLOWED DISRUPTIONS" set to a value greater than 0 during an AKS cluster upgrade, it signifies a more flexible constraint on pod disruptions. Let's examine the impact and potential benefits:


**Impact on AKS Cluster Upgrade Process:**


1. **Gradual Pod Disruptions:**

   - With "ALLOWED DISRUPTIONS" greater than 0, the AKS upgrade process can gradually disrupt a specified number of pod replicas at a time.

   - Pods are evicted in a controlled manner, allowing the upgrade to progress without waiting for all replicas to be available simultaneously.


2. **Faster Upgrade Process:**

   - The flexibility provided by allowing a certain number of disruptions facilitates a potentially faster upgrade process compared to a PDB with "ALLOWED DISRUPTIONS" set strictly to 0.

   - This is beneficial when there's a need to complete the upgrade within a reasonable timeframe.


3. **Reduced Downtime Risk:**

   - Allowing some disruptions reduces the risk of prolonged downtime during the upgrade.

   - Applications may continue to function with a slightly reduced capacity, minimizing the impact on end-users.


4. **Optimized Resource Utilization:**

   - The AKS upgrade process can optimize resource utilization by evicting pods gradually, ensuring a balanced distribution of workload across available nodes.


**Circumstances in Which it Might be Beneficial:**


1. **Balancing Speed and Availability:**

   - When there's a need to balance the speed of the upgrade with maintaining a reasonable level of availability, setting "ALLOWED DISRUPTIONS" to a value greater than 0 is beneficial.


2. **Resource Constraints:**

   - In scenarios where resource constraints or node capacities may limit the simultaneous rescheduling of pods, allowing controlled disruptions can help manage these limitations.


3. **Applications Tolerant to Disruptions:**

   - For non-critical applications that can tolerate temporary disruptions, setting "ALLOWED DISRUPTIONS" to a higher value can expedite the upgrade without compromising the overall stability.


4. **Phased Rollouts:**

   - When dealing with a large number of replicas or diverse applications, allowing disruptions in phases can be strategically advantageous, preventing potential bottlenecks.


5. **Customized Upgrade Strategies:**

   - For specific applications or services with unique requirements, setting "ALLOWED DISRUPTIONS" provides the flexibility to tailor upgrade strategies according to their specific needs.


In summary, setting "ALLOWED DISRUPTIONS" to a value greater than 0 in a PDB during an AKS cluster upgrade introduces flexibility, allowing for a more balanced trade-off between upgrade speed and maintaining a certain level of availability for applications. This approach is particularly useful in scenarios where a strict constraint on disruptions is not critical, and a faster, more gradual upgrade is desired.

Thanks for reading... 

No comments: