About Me

My photo
I am an MCSE in Data Management and Analytics, specializing in MS SQL Server, and an MCP in Azure. With over 19+ years of experience in the IT industry, I bring expertise in data management, Azure Cloud, Data Center Migration, Infrastructure Architecture planning, as well as Virtualization and automation. I have a deep passion for driving innovation through infrastructure automation, particularly using Terraform for efficient provisioning. If you're looking for guidance on automating your infrastructure or have questions about Azure, SQL Server, or cloud migration, feel free to reach out. I often write to capture my own experiences and insights for future reference, but I hope that sharing these experiences through my blog will help others on their journey as well. Thank you for reading!

Troubleshooting AlwaysOn Availability Group Access Issues in SQL Server: A Comprehensive Guide

 

Introduction:

Managing SQL Server instances with AlwaysOn Availability Groups can sometimes lead to frustrating access issues. One such error is Error 983, where users encounter problems accessing a database because the database replica is not in the PRIMARY or SECONDARY role. This can block access to the database and prevent the expansion of the AlwaysOn High Availability folder.

In this blog post, we will walk through the steps to troubleshoot and resolve this issue, as well as review important concepts related to clustering and availability groups in SQL Server. By the end, you will have a deeper understanding of how to manage AlwaysOn Availability Groups effectively and how to troubleshoot issues such as Cluster Service failures and DCOM communication errors.


Table of Contents:

  1. Understanding AlwaysOn Availability Groups in SQL Server
  2. What is Error 983?
  3. Step-by-Step Troubleshooting Guide
    • Step 1: Check Cluster Core Group Resources
    • Step 2: Restart Cluster Services
    • Step 3: Resolve DCOM Error 10028
  4. Memory Techniques and Mnemonics
    • Mnemonic for Troubleshooting AlwaysOn Access Issues
    • Story-Based Memory Technique
  5. Use Case: Resolving Access Issues in a Production SQL Server Cluster
  6. Conclusion

1. Understanding AlwaysOn Availability Groups in SQL Server

AlwaysOn Availability Groups provide high availability and disaster recovery for SQL Server databases by replicating data between multiple servers (known as replicas). Availability Groups contain multiple databases that can failover as a group. SQL Server replicas can either be in PRIMARY (active) or SECONDARY (replica) roles.

When one replica fails or is unreachable, databases may not be accessible unless the replica is switched to a PRIMARY or SECONDARY role. This is where issues like Error 983 can arise.


2. What is Error 983?

Error 983 in SQL Server occurs when you try to access a database in an AlwaysOn Availability Group and the database replica is neither in the PRIMARY nor SECONDARY role. Essentially, this error blocks access to the database because SQL Server only allows connections to databases that are in one of these roles.

Example Error Message:

vbnet

Unable to access availability database 'AdventureWorks2014' because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role.

This is typically seen when trying to expand the AlwaysOn High Availability folder in SQL Server Management Studio (SSMS).


3. Step-by-Step Troubleshooting Guide

Step 1: Check Cluster Core Group Resources

  1. Open Cluster Administrator on the node where the error is occurring.
  2. Ensure that you are connecting to the correct cluster name.
  3. Check the status of Cluster Core Group Resources.
    • If the Cluster Core Group Resources are down, attempt to bring them online.
Powershell

# Command to check cluster resources in PowerShell Get-ClusterResource

Step 2: Restart Cluster Services

If the cluster core resources don't start immediately, review the cluster event logs for any error messages related to Cluster Resource Hosting Subsystem (RHS), which is responsible for resource management.

  1. Open System Event Viewer to check for the following error:

    • Error No: 10028: "DCOM was unable to communicate with the computer Win2k12Cluster.Adven.com..."
  2. Restart the Cluster Service on all nodes:

    • Open PowerShell and use the following command:
bash

Restart-Service clussvc

If restarting the Cluster Service fails on certain nodes, you may need to investigate further.

Step 3: Resolve DCOM Error 10028

  1. Open Windows Firewall with Advanced Security (Run wf.msc).

  2. In Inbound Rules, enable the rule for COM+ Network Access (DCOM In).

    • Ensure the protocol is TCP and the port is 135.
  3. Kill the clussvc.exe process from the Task Manager on the affected node.

Repeat these steps on each node in the cluster to ensure that the Cluster Core Group Resources are brought back online.


4. Memory Techniques and Mnemonics

Mnemonic for Troubleshooting AlwaysOn Access Issues:

Use the mnemonic “CRASH” to remember the critical steps when troubleshooting AlwaysOn issues:

  • C: Check Cluster Core Resources.
  • R: Restart Cluster Services.
  • A: Address DCOM Errors (Error 10028).
  • S: System Event Logs (check for clues).
  • H: Handle clussvc process (kill and restart).

Story-Based Memory Technique:

Imagine you are managing a team of servers that need to work together like teammates in a relay race (cluster). Sometimes, one teammate doesn’t grab the baton (PRIMARY or SECONDARY role), which causes the whole race to stall. To fix this:

  1. Check the baton (check Cluster Core Resources).
  2. Reorganize the team by restarting the race (restart the cluster service).
  3. If one teammate can't run, check for miscommunication in the team (DCOM errors).
  4. Finally, remove the tired runner (kill clussvc.exe) and let them rest before the next race.

This story helps reinforce the sequence of troubleshooting steps and the relationship between cluster nodes and services.


5. Use Case: Resolving Access Issues in a Production SQL Server Cluster

Scenario:

In a production environment, the SQL Server’s AlwaysOn Availability Group managing the company’s primary database, AdventureWorks2014, becomes inaccessible due to Error 983. This error prevents the database from being accessed because the replica is not in the PRIMARY or SECONDARY role.

Solution:

  1. The administrator first checks the status of Cluster Core Group Resources and notices that they are offline.
  2. By restarting the Cluster Service on all nodes, they bring the Cluster Core Group back online.
  3. Next, they troubleshoot DCOM Error 10028 by adjusting the firewall settings and ensuring proper communication between nodes.
  4. Finally, by killing and restarting the clussvc.exe process on affected nodes, the administrator restores full access to the database and the AlwaysOn Availability Group.

By following these steps, access to AdventureWorks2014 is restored without any data loss or significant downtime.


6. Conclusion

Managing AlwaysOn Availability Groups in SQL Server requires a solid understanding of both SQL Server internals and Windows clustering. Error 983, indicating that a database replica is neither in the PRIMARY nor SECONDARY role, can be a critical issue that disrupts database access. However, with a systematic approach—checking Cluster Core Group Resources, restarting Cluster Services, and resolving DCOM errors—you can quickly resolve these issues.

This guide provides a clear step-by-step process for troubleshooting AlwaysOn Availability Groups, leveraging both SQL Server Management Studio and Azure tools. By using practical commands and insights, you can ensure your production SQL Server environments remain accessible and highly available.

No comments: