name: sca-monitoring
description: |
SCA track project #8. Learner stands up a Log Analytics workspace, onboards the VM from
project #5 via the Azure Monitor Agent and a Data Collection Rule, writes their first KQL
queries against the collected logs, then creates an alert rule that emails them when CPU
stays above 80% for 5 minutes. Auto-load when the learner is in server-cloud-admin/sca-monitoring
or asks to learn Azure Monitor, Log Analytics, KQL basics, Azure Monitor Agent, Data
Collection Rules, or how to alert on a VM metric.
Project: sca-monitoring
Track: Server & Cloud Administration · Project: 8 of 9 · Time: ~75 minutes
Visibility for the lab built in projects #5–#7. By the end the learner has a Log Analytics workspace receiving logs from the VM, can answer "what happened" with a KQL query, and gets an email when CPU spikes. The query language used here is the same KQL used everywhere in the Microsoft cloud — Sentinel, Defender, App Insights, Resource Graph. Investment compounds.
Project goal
When this project is done, the learner can:
- Create a Log Analytics workspace and explain the difference between workspace-based logs and metric-based signals.
- Onboard a VM using the modern Azure Monitor Agent (AMA) with a Data Collection Rule (DCR) — and explain why AMA replaces the older MMA/OMS agent.
- Write KQL queries with
where,project,summarize, andrender— and explain what each one does. - Create a metric alert rule on CPU that emails an action group when thresholds breach for 5 minutes — and trigger the alert deliberately so they've seen it fire.
Scope guardrail
This is single-workspace monitoring with metric + log alerts. We are not setting up Application Insights, not building cross-workspace queries, not configuring custom log ingestion (HTTP Data Collector or Custom Logs), not deploying Workbooks. One workspace, one VM, one CPU alert, ~5 KQL queries. The learner exits with their hands on the keyboard.
If the learner asks "what about distributed tracing?" — answer honestly: that's App Insights territory. Same workspace concept, different agent, different schema.
Prerequisites
| Prereq | Verify with |
|---|---|
Completed sca-azure-vnet-vm — VM vm-app01 running |
Get-AzVM -ResourceGroupName rg-mssa-azurevm -Status |
| Az.OperationalInsights and Az.Monitor modules | Get-Module Az.OperationalInsights, Az.Monitor -ListAvailable |
| Personal email address you can receive at | For the action group |
Phases
Phase 1 — Create the Log Analytics workspace (~10 min)
Goal: A new Log Analytics workspace exists in the same region as the VM.
Commands:
$rg = "rg-mssa-azurevm"
$location = "eastus"
$wsName = "law-mssa"
New-AzOperationalInsightsWorkspace `
-ResourceGroupName $rg `
-Name $wsName `
-Location $location `
-Sku "PerGB2018" `
-RetentionInDays 30
# Verify
Get-AzOperationalInsightsWorkspace -ResourceGroupName $rg -Name $wsName |
Select-Object Name, Sku, RetentionInDays, CustomerId
Concepts to name out loud:
- This is a Log Analytics workspace as the database — every log emitted by any Azure resource you point at this workspace lands in tables you can query with KQL. The workspace is billed per GB ingested and per GB-day retained.
- This is "PerGB2018" pricing — the modern pay-as-you-go tier. Roughly $2.30/GB ingested. A small lab VM emits ~50-150 MB/day. Realistically a few dollars a month if you're frugal. There's also a Commitment Tier (cheaper per GB if you commit to volume) and Free (now retired) and Standalone (legacy).
- This is retention as a separate cost — the first 31 days are included. Beyond that, you pay per GB-month. Default 30 keeps the lab cost minimal.
Common gotchas:
- Workspace name has to be unique within the subscription, not globally. Random suffix not required.
- Don't pick a region different from the VM unless you have a reason — cross-region ingestion has more failure modes.
After-action prompt: "You picked PerGB2018 and 30-day retention. If you collected 10 GB/month of logs, roughly what would you pay? What would push it up tenfold?"
Phase 2 — Onboard the VM with AMA + a Data Collection Rule (~25 min)
Goal: Azure Monitor Agent is installed on the VM, a DCR sends Windows perf counters and System event log entries to the workspace.
The mental model first: With AMA, the agent itself collects nothing by default. A Data Collection Rule says which data sources (perf counters, event logs, syslog, etc.) get collected and where they go (your workspace). One agent can serve many DCRs; one DCR can target many VMs. Decoupling is the point.
Install the AMA on the VM:
# Set the AMA extension on vm-app01 (idempotent — safe to re-run)
Set-AzVMExtension `
-ResourceGroupName $rg `
-VMName "vm-app01" `
-Name "AzureMonitorWindowsAgent" `
-Publisher "Microsoft.Azure.Monitor" `
-ExtensionType "AzureMonitorWindowsAgent" `
-TypeHandlerVersion "1.0" `
-EnableAutomaticUpgrade $true `
-Location $location
Create a Data Collection Rule — the JSON approach is verbose, the portal is easier the first time:
- portal.azure.com → search Monitor → Data Collection Rules → + Create.
- Basics: name
dcr-mssa-vm, regioneastus, platform type Windows. - Resources: + Add resources → select
vm-app01. - Collect and deliver:
- + Add data source → Data source type: Performance Counters → Sampling rate 60s → keep defaults (CPU, Memory, Disk, Network) → Destination: your
law-mssaworkspace → Add. - + Add data source → Data source type: Windows Event Logs → keep defaults (System, Application errors) → Destination:
law-mssa→ Add.
- + Add data source → Data source type: Performance Counters → Sampling rate 60s → keep defaults (CPU, Memory, Disk, Network) → Destination: your
- Review + create.
Verify ingestion (wait ~5-10 min for first data):
- portal.azure.com → Log Analytics workspaces →
law-mssa→ Logs. - Try this query (paste into the query editor):
Perf
| where TimeGenerated > ago(15m)
| where Computer == "vm-app01"
| summarize count() by ObjectName, CounterName
If you see rows, ingestion is working. If nothing returns, wait 5 more minutes (initial collection can be delayed).
Concepts to name out loud:
- This is AMA (Azure Monitor Agent) replacing MMA (Microsoft Monitoring Agent) and OMS Agent — Microsoft retired MMA in August 2024. AMA is the only future-proof option. Anything you read older that says "OMS workspace" or "MMA" is dated.
- This is a DCR as the "what gets collected" contract — same agent, different DCR = different data. Run a fleet of 1000 VMs against one DCR and your collection is consistent.
- This is the
PerfandEventtables — the most common starting points. Perf has performance counters; Event has Windows event log entries. Each table has a fixed schema you cangetschemato inspect.
Common gotchas:
- AMA installed but no data → DCR isn't associated, or wrong destination, or perf counters aren't selected. Open the DCR in portal and re-confirm.
- Data appearing in
Heartbeattable but notPerf→ AMA is alive but DCR isn't sending perf. Common after a DCR edit. - Confused old docs say install OMS agent → ignore; AMA is the answer.
After-action prompt: "You installed one agent and created one DCR. If management said 'now do the same for 50 VMs,' what do you do — install AMA on 50 VMs and create 50 DCRs, or something smarter?"
Phase 3 — KQL basics: ask real questions (~20 min)
Goal: The learner writes five KQL queries that answer questions about the VM, by hand, without copy-paste.
Open Logs in the workspace and run each in order:
Q1 — "Is the agent alive?"
Heartbeat
| where TimeGenerated > ago(1h)
| where Computer == "vm-app01"
| summarize last_seen=max(TimeGenerated)
Q2 — "What's the CPU been doing?"
Perf
| where TimeGenerated > ago(1h)
| where Computer == "vm-app01"
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where InstanceName == "_Total"
| summarize avg(CounterValue), max(CounterValue) by bin(TimeGenerated, 5m)
| render timechart
Q3 — "Top 5 disk-using counters right now"
Perf
| where TimeGenerated > ago(15m)
| where Computer == "vm-app01"
| where ObjectName == "LogicalDisk"
| summarize avg(CounterValue) by CounterName
| top 5 by avg_CounterValue
Q4 — "Any error or critical events from System log in the last 24h?"
Event
| where TimeGenerated > ago(24h)
| where Computer == "vm-app01"
| where EventLog == "System"
| where EventLevelName in ("Error", "Critical")
| project TimeGenerated, EventID, Source, RenderedDescription
| order by TimeGenerated desc
| take 20
Q5 — "Count of distinct event sources writing errors in the last day"
Event
| where TimeGenerated > ago(24h)
| where Computer == "vm-app01"
| where EventLevelName == "Error"
| summarize errors=count() by Source
| order by errors desc
Concepts to name out loud:
- This is KQL as pipe-and-table — every query starts with a table name (
Perf,Heartbeat,Event), then|chains operators. Each operator transforms the table. - This is the core operators:
where,project,summarize,order by,top,render—wherefilters rows,projectpicks columns,summarizeis GROUP BY,order bysorts,topis "top N",renderchooses a visualization. - This is
bin(TimeGenerated, 5m)as time bucketing — group your data into 5-minute buckets for charting. Replace 5m with 1h, 1d, etc. - This is
render timechart— KQL renders a time-series chart inline. Useful for spotting patterns the table view hides.
Common gotchas:
- "Could not parse query" → KQL is case-sensitive on operators (lowercase) and table names (PascalCase).
summarizerequires an aggregation function —count(),avg(x),max(x),min(x),percentile(x, 95), etc.renderonly works in the portal (and a few other clients), not via API. For programmatic queries, droprender.
After-action prompt: "You ran 5 KQL queries. If asked 'are there any unusual events on this VM in the last hour,' what query would you write — and which operators would you definitely use?"
Phase 4 — Alert on CPU spike, prove it fires (~15 min)
Goal: A metric alert rule emails the learner when CPU > 80% for 5 minutes. The learner triggers it deliberately to confirm the email arrives.
Step 1 — Create an Action Group (where alerts go):
$agName = "ag-mssa-email"
$emailReceiver = New-AzActionGroupReceiver -Name "me" -EmailReceiver -EmailAddress "your.email@example.com"
Set-AzActionGroup `
-ResourceGroupName $rg `
-Name $agName `
-ShortName "mssa" `
-Receiver $emailReceiver `
-Location "Global"
Step 2 — Create the metric alert rule:
$vm = Get-AzVM -ResourceGroupName $rg -Name "vm-app01"
$ag = Get-AzActionGroup -ResourceGroupName $rg -Name $agName
# The metric criterion: CPU > 80% averaged over the window
$criteria = New-AzMetricAlertRuleV2Criteria `
-MetricName "Percentage CPU" `
-MetricNamespace "Microsoft.Compute/virtualMachines" `
-TimeAggregation Average `
-Operator GreaterThan `
-Threshold 80
# Create the alert: evaluates every 1 min, window of 5 min, severity 2 (Warning)
Add-AzMetricAlertRuleV2 `
-Name "alert-cpu-high" `
-ResourceGroupName $rg `
-WindowSize (New-TimeSpan -Minutes 5) `
-Frequency (New-TimeSpan -Minutes 1) `
-TargetResourceId $vm.Id `
-Condition $criteria `
-ActionGroupId $ag.Id `
-Severity 2 `
-Description "vm-app01 CPU > 80% for 5 minutes"
# Verify
Get-AzMetricAlertRuleV2 -ResourceGroupName $rg -Name "alert-cpu-high"
Step 3 — Trigger it on purpose (on the VM, PowerShell as Admin):
# A 7-minute CPU burner — exceeds the 5-min window
1..8 | ForEach-Object -Parallel {
$end = (Get-Date).AddMinutes(7)
while ((Get-Date) -lt $end) {
1..1000000 | ForEach-Object { [math]::Sqrt($_) } | Out-Null
}
} -ThrottleLimit 8
Watch the metric in the portal: VM → Monitoring → Metrics → Percentage CPU. After ~5 minutes above 80% you should get an email.
Step 4 — Verify the fire:
# All alerts fired in the last hour
Get-AzMetricAlertStatus -ResourceGroupName $rg -Name "alert-cpu-high"
Concepts to name out loud:
- This is a metric alert vs a log alert — metric alerts evaluate numeric metrics (CPU %, disk IOPS, request count) at fast cadence (1 min) with low cost. Log alerts run a KQL query at slower cadence (5/15 min) and can express anything KQL can express. Use metric when possible (cheaper, faster).
- This is action groups as the reusable destination — write the action group once, attach it to many alerts. Channels: email, SMS, voice, webhook, Logic App, Azure Function, ITSM, runbook.
- This is severity as a soft contract with operators — Sev 0/1 = wake someone up. Sev 2 = handle today. Sev 3/4 = log and review. Use it consistently or it's noise.
- This is window vs frequency —
Frequency 1m= "check every 1 min."WindowSize 5m= "across the last 5 min of data." Both matter. Window must be ≥ frequency.
Common gotchas:
- Alert fires but no email → bad email address, or it went to spam. Check the action group's email receiver delivery status.
- Alert never fires → CPU burn wasn't intense enough. The PowerShell parallel-loop above usually hits 100% on a B2s. If not, increase ThrottleLimit.
- Email arrived 7+ minutes after threshold breach → expected. There's metric ingestion lag (~1-2 min) plus the 5-min window plus delivery time. Tune window down for tighter alerting if needed (with more false positives).
After-action prompt: "You got an alert email. Walk me through the timeline from 'CPU went above 80%' to 'email arrived in inbox' — and what you'd change if 7 minutes was too slow for production."
Phase 5 — Clean up the noisy stuff (~5 min)
Goal: The alert is disabled if you don't want a flood of emails, the workspace cost is bounded, and the learner knows how to keep the lab cheap.
Disable (don't delete) the alert:
$alert = Get-AzMetricAlertRuleV2 -ResourceGroupName $rg -Name "alert-cpu-high"
$alert.Enabled = $false
$alert | Set-AzMetricAlertRuleV2
Cap workspace daily ingestion (cost guard):
# Set a daily cap at 1 GB/day — ingestion stops after that
Set-AzOperationalInsightsWorkspace -ResourceGroupName $rg -Name $wsName -DailyQuotaGb 1
After-action prompt: "You've now seen how to monitor and alert. If you owned this VM for a year, what cost guardrails would you set up so an outage doesn't double as a billing event?"
When to break the method
- Learner already knows KQL from Sentinel or another product → skim phase 3, focus on phase 4 (the alerting mechanics most people fumble).
- Strict budget → use the free 5-GB/month allowance plus 31-day default retention. The whole lab here costs single-digit dollars/month at this volume.
- Time short → phases 1-2-4 are the must-do. Phase 3 (KQL) can be a follow-up exercise; phase 5 a hygiene reminder.
Definition of done
Observable, the learner can:
- Show the Log Analytics workspace receiving heartbeats with
Heartbeat | where Computer == "vm-app01" | take 5. - Run the CPU
Perf | render timechartquery and see the chart. - Show the DCR in the portal with vm-app01 as a resource and Performance Counters + System events as data sources.
- Show the alert rule, trigger it with the CPU loop, and produce the email it generated.
- Explain in one sentence each: workspace, DCR, action group, metric alert vs log alert.
Next project
→ sca-arm-bicep-iac — the capstone: take everything built across the track and express it as Bicep code. Tear down and redeploy in minutes.