About Me

My photo
I am an MCSE in Data Management and Analytics, specializing in MS SQL Server, and an MCP in Azure. With over 19+ years of experience in the IT industry, I bring expertise in data management, Azure Cloud, Data Center Migration, Infrastructure Architecture planning, as well as Virtualization and automation. I have a deep passion for driving innovation through infrastructure automation, particularly using Terraform for efficient provisioning. If you're looking for guidance on automating your infrastructure or have questions about Azure, SQL Server, or cloud migration, feel free to reach out. I often write to capture my own experiences and insights for future reference, but I hope that sharing these experiences through my blog will help others on their journey as well. Thank you for reading!

# MCP Server Monitoring and Observability Guide



This guide covers monitoring, logging, and observability for the MCP Server deployment.

## Table of Contents

1. [Azure Monitor Integration](#azure-monitor-integration)
2. [Log Analytics](#log-analytics)
3. [Application Insights](#application-insights)
4. [Alerts and Notifications](#alerts-and-notifications)
5. [Dashboards](#dashboards)
6. [Metrics](#metrics)
7. [Troubleshooting](#troubleshooting)

## Azure Monitor Integration

The MCP Server is fully integrated with Azure Monitor for comprehensive observability.

### Key Components

- **Log Analytics Workspace**: Centralized log storage
- **Application Insights**: Application performance monitoring
- **Azure Monitor Metrics**: Resource-level metrics
- **Container App Logs**: Application and system logs

## Log Analytics

### Accessing Logs

1. Navigate to Azure Portal
2. Go to your Log Analytics Workspace
3. Select "Logs" from the left menu

### Common Queries

#### View All Application Logs
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| project TimeGenerated, Log_s
| order by TimeGenerated desc
| take 100
```

#### Search for Errors
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "error" or Log_s contains "ERROR"
| project TimeGenerated, Log_s
| order by TimeGenerated desc
```

#### Authentication Failures
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "401" or Log_s contains "Unauthorized"
| project TimeGenerated, Log_s
| order by TimeGenerated desc
```

#### User Activity
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "User authenticated"
| extend UserId = extract("userId\":\"([^\"]+)", 1, Log_s)
| summarize Count = count() by UserId, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
```

#### Performance Metrics
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "response time" or Log_s contains "duration"
| extend ResponseTime = todouble(extract("duration\":([0-9]+)", 1, Log_s))
| summarize avg(ResponseTime), max(ResponseTime), min(ResponseTime) by bin(TimeGenerated, 5m)
```

#### Database Query Performance
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "database" and Log_s contains "query"
| extend QueryDuration = todouble(extract("duration\":([0-9]+)", 1, Log_s))
| summarize avg(QueryDuration), count() by bin(TimeGenerated, 5m)
```

## Application Insights

### Key Metrics

1. **Request Rate**: Requests per second
2. **Response Time**: Average response time
3. **Failure Rate**: Failed requests percentage
4. **Dependencies**: External service calls (database, etc.)

### Viewing Metrics

Navigate to: **Application Insights > Investigate > Performance**

### Custom Metrics

The MCP Server emits custom metrics:

- `mcp.connections.active`: Active MCP connections
- `mcp.tools.calls`: Tool call count
- `mcp.auth.success`: Successful authentications
- `mcp.auth.failed`: Failed authentications

## Alerts and Notifications

### Recommended Alerts

#### High Error Rate
```json
{
  "name": "High Error Rate",
  "description": "Alert when error rate exceeds 5%",
  "condition": {
    "metric": "requests/failed",
    "threshold": 5,
    "timeAggregation": "Average",
    "windowSize": "PT5M"
  },
  "actions": [
    {
      "actionGroup": "ops-team",
      "emailSubject": "MCP Server High Error Rate"
    }
  ]
}
```

#### High Response Time
```json
{
  "name": "High Response Time",
  "description": "Alert when average response time exceeds 2 seconds",
  "condition": {
    "metric": "requests/duration",
    "threshold": 2000,
    "timeAggregation": "Average",
    "windowSize": "PT5M"
  }
}
```

#### Authentication Failures
```json
{
  "name": "Authentication Failures",
  "description": "Alert on repeated authentication failures",
  "condition": {
    "query": "ContainerAppConsoleLogs_CL | where Log_s contains 'Authentication failed' | summarize count()",
    "threshold": 10,
    "timeAggregation": "Total",
    "windowSize": "PT5M"
  }
}
```

#### Low Availability
```json
{
  "name": "Container App Unhealthy",
  "description": "Alert when health check fails",
  "condition": {
    "metric": "healthcheck/status",
    "threshold": 1,
    "operator": "LessThan",
    "windowSize": "PT5M"
  }
}
```

### Creating Alerts via Azure CLI

```bash
# Create action group
az monitor action-group create \
  --name ops-team \
  --resource-group rg-mcp-server-prod \
  --short-name ops \
  --email admin admin@yourcompany.com

# Create metric alert
az monitor metrics alert create \
  --name high-error-rate \
  --resource-group rg-mcp-server-prod \
  --scopes /subscriptions/{sub-id}/resourceGroups/rg-mcp-server-prod/providers/Microsoft.App/containerApps/ca-mcpserver-prod \
  --condition "total requests/failed > 5" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action ops-team
```

## Dashboards

### Create Custom Dashboard

1. Navigate to Azure Portal
2. Select "Dashboard" > "New dashboard"
3. Add tiles for:
   - Request count
   - Response time
   - Error rate
   - Active connections
   - CPU/Memory usage

### Sample Dashboard JSON

```json
{
  "lenses": {
    "0": {
      "order": 0,
      "parts": {
        "0": {
          "position": {
            "x": 0,
            "y": 0,
            "colSpan": 6,
            "rowSpan": 4
          },
          "metadata": {
            "type": "Extension/HubsExtension/PartType/MonitorChartPart",
            "settings": {
              "title": "Request Rate",
              "visualization": {
                "chartType": "Line",
                "legendVisualization": {
                  "isVisible": true
                }
              }
            }
          }
        }
      }
    }
  }
}
```

## Metrics

### Container App Metrics

| Metric | Description | Threshold |
|--------|-------------|-----------|
| Replica Count | Number of active replicas | Min: 2, Max: 10 |
| CPU Usage | CPU percentage | < 80% |
| Memory Usage | Memory percentage | < 80% |
| Request Count | Total requests | Monitor trends |
| Request Duration | Average response time | < 2 seconds |

### Database Metrics

| Metric | Description | Threshold |
|--------|-------------|-----------|
| Connections | Active connections | < 80% of max |
| CPU Usage | Database CPU | < 80% |
| Storage | Used storage | < 80% of quota |
| Query Duration | Average query time | < 500ms |

### Application Gateway Metrics

| Metric | Description | Threshold |
|--------|-------------|-----------|
| Throughput | Bytes/second | Monitor trends |
| Failed Requests | Count of 5xx errors | < 1% |
| Backend Response Time | Time to first byte | < 1 second |
| Healthy Host Count | Number of healthy backends | > 0 |

## Troubleshooting

### Common Issues

#### 1. High Response Time

**Symptoms**: Slow API responses

**Investigation**:
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| extend Duration = todouble(extract("duration\":([0-9]+)", 1, Log_s))
| where Duration > 2000
| project TimeGenerated, Log_s
```

**Solutions**:
- Scale up replicas
- Optimize database queries
- Check network latency
- Review application code

#### 2. Authentication Failures

**Symptoms**: 401 errors

**Investigation**:
```kusto
ContainerAppConsoleLogs_CL
| where Log_s contains "Token verification failed"
| project TimeGenerated, Log_s
```

**Solutions**:
- Verify Entra ID configuration
- Check token expiration
- Validate audience/issuer settings
- Review user permissions

#### 3. Database Connection Issues

**Symptoms**: Database errors

**Investigation**:
```kusto
ContainerAppConsoleLogs_CL
| where Log_s contains "PostgreSQL" and Log_s contains "error"
| project TimeGenerated, Log_s
```

**Solutions**:
- Check connection string
- Verify firewall rules
- Check connection pool size
- Review database health

#### 4. Memory Leaks

**Symptoms**: Increasing memory usage

**Investigation**:
- Check container app metrics
- Review memory usage trends
- Look for unclosed connections

**Solutions**:
- Restart container app
- Review application code
- Implement connection pooling
- Add memory limits

### Health Check Endpoints

#### Application Health
```bash
curl https://mcp.yourcompany.com/health
```

Expected Response:
```json
{
  "status": "healthy",
  "timestamp": "2025-12-09T10:00:00Z",
  "version": "1.0.0",
  "uptime": 86400
}
```

#### Readiness Check
```bash
curl https://mcp.yourcompany.com/ready
```

#### Metrics Endpoint
```bash
curl -H "Authorization: Bearer $TOKEN" https://mcp.yourcompany.com/metrics
```

## Log Retention

- **Container App Logs**: 30 days (configurable)
- **Log Analytics**: 30 days (configurable up to 730 days)
- **Application Insights**: 90 days default
- **Archived Logs**: Configure export to Storage Account for long-term retention

## Exporting Logs

### To Storage Account

```bash
az monitor diagnostic-settings create \
  --name export-to-storage \
  --resource /subscriptions/{sub-id}/resourceGroups/rg-mcp-server-prod/providers/Microsoft.App/containerApps/ca-mcpserver-prod \
  --storage-account {storage-account-id} \
  --logs '[{"category":"ContainerAppConsoleLogs","enabled":true}]'
```

### To Event Hub

```bash
az monitor diagnostic-settings create \
  --name export-to-eventhub \
  --resource /subscriptions/{sub-id}/resourceGroups/rg-mcp-server-prod/providers/Microsoft.App/containerApps/ca-mcpserver-prod \
  --event-hub {event-hub-name} \
  --event-hub-rule {auth-rule-id} \
  --logs '[{"category":"ContainerAppConsoleLogs","enabled":true}]'
```

## Best Practices

1. **Set up alerts early** - Don't wait for incidents
2. **Review logs regularly** - Weekly log reviews
3. **Monitor trends** - Look for patterns over time
4. **Document incidents** - Keep runbooks updated
5. **Test alerts** - Ensure notifications work
6. **Rotate credentials** - Regular security reviews
7. **Capacity planning** - Monitor growth trends
8. **Cost optimization** - Review unused resources

## Support

For monitoring issues:
- DevOps Team: devops@yourcompany.com
- Azure Support: https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade

No comments: