This guide covers monitoring, logging, and observability for the MCP Server deployment.
## Table of Contents
1. [Azure Monitor Integration](#azure-monitor-integration)
2. [Log Analytics](#log-analytics)
3. [Application Insights](#application-insights)
4. [Alerts and Notifications](#alerts-and-notifications)
5. [Dashboards](#dashboards)
6. [Metrics](#metrics)
7. [Troubleshooting](#troubleshooting)
## Azure Monitor Integration
The MCP Server is fully integrated with Azure Monitor for comprehensive observability.
### Key Components
- **Log Analytics Workspace**: Centralized log storage
- **Application Insights**: Application performance monitoring
- **Azure Monitor Metrics**: Resource-level metrics
- **Container App Logs**: Application and system logs
## Log Analytics
### Accessing Logs
1. Navigate to Azure Portal
2. Go to your Log Analytics Workspace
3. Select "Logs" from the left menu
### Common Queries
#### View All Application Logs
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| project TimeGenerated, Log_s
| order by TimeGenerated desc
| take 100
```
#### Search for Errors
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "error" or Log_s contains "ERROR"
| project TimeGenerated, Log_s
| order by TimeGenerated desc
```
#### Authentication Failures
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "401" or Log_s contains "Unauthorized"
| project TimeGenerated, Log_s
| order by TimeGenerated desc
```
#### User Activity
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "User authenticated"
| extend UserId = extract("userId\":\"([^\"]+)", 1, Log_s)
| summarize Count = count() by UserId, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
```
#### Performance Metrics
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "response time" or Log_s contains "duration"
| extend ResponseTime = todouble(extract("duration\":([0-9]+)", 1, Log_s))
| summarize avg(ResponseTime), max(ResponseTime), min(ResponseTime) by bin(TimeGenerated, 5m)
```
#### Database Query Performance
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| where Log_s contains "database" and Log_s contains "query"
| extend QueryDuration = todouble(extract("duration\":([0-9]+)", 1, Log_s))
| summarize avg(QueryDuration), count() by bin(TimeGenerated, 5m)
```
## Application Insights
### Key Metrics
1. **Request Rate**: Requests per second
2. **Response Time**: Average response time
3. **Failure Rate**: Failed requests percentage
4. **Dependencies**: External service calls (database, etc.)
### Viewing Metrics
Navigate to: **Application Insights > Investigate > Performance**
### Custom Metrics
The MCP Server emits custom metrics:
- `mcp.connections.active`: Active MCP connections
- `mcp.tools.calls`: Tool call count
- `mcp.auth.success`: Successful authentications
- `mcp.auth.failed`: Failed authentications
## Alerts and Notifications
### Recommended Alerts
#### High Error Rate
```json
{
"name": "High Error Rate",
"description": "Alert when error rate exceeds 5%",
"condition": {
"metric": "requests/failed",
"threshold": 5,
"timeAggregation": "Average",
"windowSize": "PT5M"
},
"actions": [
{
"actionGroup": "ops-team",
"emailSubject": "MCP Server High Error Rate"
}
]
}
```
#### High Response Time
```json
{
"name": "High Response Time",
"description": "Alert when average response time exceeds 2 seconds",
"condition": {
"metric": "requests/duration",
"threshold": 2000,
"timeAggregation": "Average",
"windowSize": "PT5M"
}
}
```
#### Authentication Failures
```json
{
"name": "Authentication Failures",
"description": "Alert on repeated authentication failures",
"condition": {
"query": "ContainerAppConsoleLogs_CL | where Log_s contains 'Authentication failed' | summarize count()",
"threshold": 10,
"timeAggregation": "Total",
"windowSize": "PT5M"
}
}
```
#### Low Availability
```json
{
"name": "Container App Unhealthy",
"description": "Alert when health check fails",
"condition": {
"metric": "healthcheck/status",
"threshold": 1,
"operator": "LessThan",
"windowSize": "PT5M"
}
}
```
### Creating Alerts via Azure CLI
```bash
# Create action group
az monitor action-group create \
--name ops-team \
--resource-group rg-mcp-server-prod \
--short-name ops \
--email admin admin@yourcompany.com
# Create metric alert
az monitor metrics alert create \
--name high-error-rate \
--resource-group rg-mcp-server-prod \
--scopes /subscriptions/{sub-id}/resourceGroups/rg-mcp-server-prod/providers/Microsoft.App/containerApps/ca-mcpserver-prod \
--condition "total requests/failed > 5" \
--window-size 5m \
--evaluation-frequency 1m \
--action ops-team
```
## Dashboards
### Create Custom Dashboard
1. Navigate to Azure Portal
2. Select "Dashboard" > "New dashboard"
3. Add tiles for:
- Request count
- Response time
- Error rate
- Active connections
- CPU/Memory usage
### Sample Dashboard JSON
```json
{
"lenses": {
"0": {
"order": 0,
"parts": {
"0": {
"position": {
"x": 0,
"y": 0,
"colSpan": 6,
"rowSpan": 4
},
"metadata": {
"type": "Extension/HubsExtension/PartType/MonitorChartPart",
"settings": {
"title": "Request Rate",
"visualization": {
"chartType": "Line",
"legendVisualization": {
"isVisible": true
}
}
}
}
}
}
}
}
}
```
## Metrics
### Container App Metrics
| Metric | Description | Threshold |
|--------|-------------|-----------|
| Replica Count | Number of active replicas | Min: 2, Max: 10 |
| CPU Usage | CPU percentage | < 80% |
| Memory Usage | Memory percentage | < 80% |
| Request Count | Total requests | Monitor trends |
| Request Duration | Average response time | < 2 seconds |
### Database Metrics
| Metric | Description | Threshold |
|--------|-------------|-----------|
| Connections | Active connections | < 80% of max |
| CPU Usage | Database CPU | < 80% |
| Storage | Used storage | < 80% of quota |
| Query Duration | Average query time | < 500ms |
### Application Gateway Metrics
| Metric | Description | Threshold |
|--------|-------------|-----------|
| Throughput | Bytes/second | Monitor trends |
| Failed Requests | Count of 5xx errors | < 1% |
| Backend Response Time | Time to first byte | < 1 second |
| Healthy Host Count | Number of healthy backends | > 0 |
## Troubleshooting
### Common Issues
#### 1. High Response Time
**Symptoms**: Slow API responses
**Investigation**:
```kusto
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-mcpserver-prod"
| extend Duration = todouble(extract("duration\":([0-9]+)", 1, Log_s))
| where Duration > 2000
| project TimeGenerated, Log_s
```
**Solutions**:
- Scale up replicas
- Optimize database queries
- Check network latency
- Review application code
#### 2. Authentication Failures
**Symptoms**: 401 errors
**Investigation**:
```kusto
ContainerAppConsoleLogs_CL
| where Log_s contains "Token verification failed"
| project TimeGenerated, Log_s
```
**Solutions**:
- Verify Entra ID configuration
- Check token expiration
- Validate audience/issuer settings
- Review user permissions
#### 3. Database Connection Issues
**Symptoms**: Database errors
**Investigation**:
```kusto
ContainerAppConsoleLogs_CL
| where Log_s contains "PostgreSQL" and Log_s contains "error"
| project TimeGenerated, Log_s
```
**Solutions**:
- Check connection string
- Verify firewall rules
- Check connection pool size
- Review database health
#### 4. Memory Leaks
**Symptoms**: Increasing memory usage
**Investigation**:
- Check container app metrics
- Review memory usage trends
- Look for unclosed connections
**Solutions**:
- Restart container app
- Review application code
- Implement connection pooling
- Add memory limits
### Health Check Endpoints
#### Application Health
```bash
curl https://mcp.yourcompany.com/health
```
Expected Response:
```json
{
"status": "healthy",
"timestamp": "2025-12-09T10:00:00Z",
"version": "1.0.0",
"uptime": 86400
}
```
#### Readiness Check
```bash
curl https://mcp.yourcompany.com/ready
```
#### Metrics Endpoint
```bash
curl -H "Authorization: Bearer $TOKEN" https://mcp.yourcompany.com/metrics
```
## Log Retention
- **Container App Logs**: 30 days (configurable)
- **Log Analytics**: 30 days (configurable up to 730 days)
- **Application Insights**: 90 days default
- **Archived Logs**: Configure export to Storage Account for long-term retention
## Exporting Logs
### To Storage Account
```bash
az monitor diagnostic-settings create \
--name export-to-storage \
--resource /subscriptions/{sub-id}/resourceGroups/rg-mcp-server-prod/providers/Microsoft.App/containerApps/ca-mcpserver-prod \
--storage-account {storage-account-id} \
--logs '[{"category":"ContainerAppConsoleLogs","enabled":true}]'
```
### To Event Hub
```bash
az monitor diagnostic-settings create \
--name export-to-eventhub \
--resource /subscriptions/{sub-id}/resourceGroups/rg-mcp-server-prod/providers/Microsoft.App/containerApps/ca-mcpserver-prod \
--event-hub {event-hub-name} \
--event-hub-rule {auth-rule-id} \
--logs '[{"category":"ContainerAppConsoleLogs","enabled":true}]'
```
## Best Practices
1. **Set up alerts early** - Don't wait for incidents
2. **Review logs regularly** - Weekly log reviews
3. **Monitor trends** - Look for patterns over time
4. **Document incidents** - Keep runbooks updated
5. **Test alerts** - Ensure notifications work
6. **Rotate credentials** - Regular security reviews
7. **Capacity planning** - Monitor growth trends
8. **Cost optimization** - Review unused resources
## Support
For monitoring issues:
- DevOps Team: devops@yourcompany.com
- Azure Support: https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade