# MCP Server Monitoring and Observability Guide



This guide covers monitoring, logging, and observability for the MCP Server deployment.

## Table of Contents

1. [Azure Monitor Integration](#azure-monitor-integration)
2. [Log Analytics](#log-analytics)
3. [Application Insights](#application-insights)
4. [Alerts and Notifications](#alerts-and-notifications)
5. [Dashboards](#dashboards)
6. [Metrics](#metrics)
7. [Troubleshooting](#troubleshooting)

## Azure Monitor Integration

The MCP Server is fully integrated with Azure Monitor for comprehensive observability.

### Key Components

- **Log Analytics Workspace**: Centralized log storage
- **Application Insights**: Application performance monitoring
- **Azure Monitor Metrics**: Resource-level metrics
- **Container App Logs**: Application and system logs

## Log Analytics

### Accessing Logs

1. Navigate to Azure Portal
2. Go to your Log Analytics Workspace
3. Select "Logs" from the left menu

### Common Queries

#### View All Application Logs
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| project TimeGenerated, Log_s
| order by TimeGenerated desc
| take 100
```

#### Search for Errors
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "error" or Log_s contains "ERROR"
| project TimeGenerated, Log_s
| order by TimeGenerated desc
```

#### Authentication Failures
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "401" or Log_s contains "Unauthorized"
| project TimeGenerated, Log_s
| order by TimeGenerated desc
```

#### User Activity
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "User authenticated"
| extend UserId = extract("userId\":\"([^\"]+)", 1, Log_s)
| summarize Count = count() by UserId, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
```

#### Performance Metrics
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "response time" or Log_s contains "duration"
| extend ResponseTime = todouble(extract("duration\":([0-9]+)", 1, Log_s))
| summarize avg(ResponseTime), max(ResponseTime), min(ResponseTime) by bin(TimeGenerated, 5m)
```

#### Database Query Performance
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "database" and Log_s contains "query"
| extend QueryDuration = todouble(extract("duration\":([0-9]+)", 1, Log_s))
| summarize avg(QueryDuration), count() by bin(TimeGenerated, 5m)
```

## Application Insights

### Key Metrics

1. **Request Rate**: Requests per second
2. **Response Time**: Average response time
3. **Failure Rate**: Failed requests percentage
4. **Dependencies**: External service calls (database, etc.)

### Viewing Metrics

Navigate to: **Application Insights > Investigate > Performance**

### Custom Metrics

The MCP Server emits custom metrics:

- `mcp.connections.active`: Active MCP connections
- `mcp.tools.calls`: Tool call count
- `mcp.auth.success`: Successful authentications
- `mcp.auth.failed`: Failed authentications

## Alerts and Notifications

### Recommended Alerts

#### High Error Rate
```json
{
  "name": "High Error Rate",
  "description": "Alert when error rate exceeds 5%",
  "condition": {
    "metric": "requests/failed",
    "threshold": 5,
    "timeAggregation": "Average",
    "windowSize": "PT5M"
  },
  "actions": [
    {
      "actionGroup": "ops-team",
      "emailSubject": "MCP Server High Error Rate"
    }
  ]
}
```

#### High Response Time
```json
{
  "name": "High Response Time",
  "description": "Alert when average response time exceeds 2 seconds",
  "condition": {
    "metric": "requests/duration",
    "threshold": 2000,
    "timeAggregation": "Average",
    "windowSize": "PT5M"
  }
}
```

#### Authentication Failures
```json
{
  "name": "Authentication Failures",
  "description": "Alert on repeated authentication failures",
  "condition": {
    "query": "ContainerAppConsoleLogs_CL | where Log_s contains 'Authentication failed' | summarize count()",
    "threshold": 10,
    "timeAggregation": "Total",
    "windowSize": "PT5M"
  }
}
```

#### Low Availability
```json
{
  "name": "Container App Unhealthy",
  "description": "Alert when health check fails",
  "condition": {
    "metric": "healthcheck/status",
    "threshold": 1,
    "operator": "LessThan",
    "windowSize": "PT5M"
  }
}
```

### Creating Alerts via Azure CLI

```bash
# Create action group
az monitor action-group create \
  --name ops-team \
  --resource-group rg-mcp-server-prod \
  --short-name ops \
  --email admin admin@yourcompany.com

# Create metric alert
az monitor metrics alert create \
  --name high-error-rate \
  --resource-group rg-mcp-server-prod \
  --scopes /subscriptions/{sub-id}/resourceGroups/rg-mcp-server-prod/providers/Microsoft.App/containerApps/ca-mcpserver-prod \
  --condition "total requests/failed > 5" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action ops-team
```

## Dashboards

### Create Custom Dashboard

1. Navigate to Azure Portal
2. Select "Dashboard" > "New dashboard"
3. Add tiles for:
   - Request count
   - Response time
   - Error rate
   - Active connections
   - CPU/Memory usage

### Sample Dashboard JSON

```json
{
  "lenses": {
    "0": {
      "order": 0,
      "parts": {
        "0": {
          "position": {
            "x": 0,
            "y": 0,
            "colSpan": 6,
            "rowSpan": 4
          },
          "metadata": {
            "type": "Extension/HubsExtension/PartType/MonitorChartPart",
            "settings": {
              "title": "Request Rate",
              "visualization": {
                "chartType": "Line",
                "legendVisualization": {
                  "isVisible": true
                }
              }
            }
          }
        }
      }
    }
  }
}
```

## Metrics

### Container App Metrics

| Metric | Description | Threshold |
|--------|-------------|-----------|
| Replica Count | Number of active replicas | Min: 2, Max: 10 |
| CPU Usage | CPU percentage | < 80% |
| Memory Usage | Memory percentage | < 80% |
| Request Count | Total requests | Monitor trends |
| Request Duration | Average response time | < 2 seconds |

### Database Metrics

| Metric | Description | Threshold |
|--------|-------------|-----------|
| Connections | Active connections | < 80% of max |
| CPU Usage | Database CPU | < 80% |
| Storage | Used storage | < 80% of quota |
| Query Duration | Average query time | < 500ms |

### Application Gateway Metrics

| Metric | Description | Threshold |
|--------|-------------|-----------|
| Throughput | Bytes/second | Monitor trends |
| Failed Requests | Count of 5xx errors | < 1% |
| Backend Response Time | Time to first byte | < 1 second |
| Healthy Host Count | Number of healthy backends | > 0 |

## Troubleshooting

### Common Issues

#### 1. High Response Time

**Symptoms**: Slow API responses

**Investigation**:
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| extend Duration = todouble(extract("duration\":([0-9]+)", 1, Log_s))
| where Duration > 2000
| project TimeGenerated, Log_s
```

**Solutions**:
- Scale up replicas
- Optimize database queries
- Check network latency
- Review application code

#### 2. Authentication Failures

**Symptoms**: 401 errors

**Investigation**:
```kusto
ContainerAppConsoleLogs_CL
| where Log_s contains "Token verification failed"
| project TimeGenerated, Log_s
```

**Solutions**:
- Verify Entra ID configuration
- Check token expiration
- Validate audience/issuer settings
- Review user permissions

#### 3. Database Connection Issues

**Symptoms**: Database errors

**Investigation**:
```kusto
ContainerAppConsoleLogs_CL
| where Log_s contains "PostgreSQL" and Log_s contains "error"
| project TimeGenerated, Log_s
```

**Solutions**:
- Check connection string
- Verify firewall rules
- Check connection pool size
- Review database health

#### 4. Memory Leaks

**Symptoms**: Increasing memory usage

**Investigation**:
- Check container app metrics
- Review memory usage trends
- Look for unclosed connections

**Solutions**:
- Restart container app
- Review application code
- Implement connection pooling
- Add memory limits

### Health Check Endpoints

#### Application Health
```bash
curl https://mcp.yourcompany.com/health
```

Expected Response:
```json
{
  "status": "healthy",
  "timestamp": "2025-12-09T10:00:00Z",
  "version": "1.0.0",
  "uptime": 86400
}
```

#### Readiness Check
```bash
curl https://mcp.yourcompany.com/ready
```

#### Metrics Endpoint
```bash
curl -H "Authorization: Bearer $TOKEN" https://mcp.yourcompany.com/metrics
```

## Log Retention

- **Container App Logs**: 30 days (configurable)
- **Log Analytics**: 30 days (configurable up to 730 days)
- **Application Insights**: 90 days default
- **Archived Logs**: Configure export to Storage Account for long-term retention

## Exporting Logs

### To Storage Account

```bash
az monitor diagnostic-settings create \
  --name export-to-storage \
  --resource /subscriptions/{sub-id}/resourceGroups/rg-mcp-server-prod/providers/Microsoft.App/containerApps/ca-mcpserver-prod \
  --storage-account {storage-account-id} \
  --logs '[{"category":"ContainerAppConsoleLogs","enabled":true}]'
```

### To Event Hub

```bash
az monitor diagnostic-settings create \
  --name export-to-eventhub \
  --resource /subscriptions/{sub-id}/resourceGroups/rg-mcp-server-prod/providers/Microsoft.App/containerApps/ca-mcpserver-prod \
  --event-hub {event-hub-name} \
  --event-hub-rule {auth-rule-id} \
  --logs '[{"category":"ContainerAppConsoleLogs","enabled":true}]'
```

## Best Practices

1. **Set up alerts early** - Don't wait for incidents
2. **Review logs regularly** - Weekly log reviews
3. **Monitor trends** - Look for patterns over time
4. **Document incidents** - Keep runbooks updated
5. **Test alerts** - Ensure notifications work
6. **Rotate credentials** - Regular security reviews
7. **Capacity planning** - Monitor growth trends
8. **Cost optimization** - Review unused resources

## Support

For monitoring issues:
- DevOps Team: devops@yourcompany.com
- Azure Support: https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade

No comments:

Post a Comment