name: problem-analysis version: 1.0.0 description: Root cause analysis and problem management including known error documentation, workaround management, and permanent fix tracking author: Happy Technologies LLC tags: [itsm, problem, rca, root-cause, known-error, workaround, itil] platforms: [claude-code, claude-desktop, chatgpt, cursor, any] tools: mcp: - SN-List-Problems - SN-Add-Problem-Comment - SN-Close-Problem - SN-Query-Table - SN-Create-Record - SN-Update-Record - SN-Add-Work-Notes rest: - /api/now/table/problem - /api/now/table/problem_task - /api/now/table/incident - /api/now/table/known_error - /api/now/table/sys_journal_field native: - Bash complexity: advanced
estimated_time: 30-90 minutes
Problem Analysis and Root Cause Investigation
Overview
This skill provides a comprehensive framework for Problem Management in ServiceNow, focusing on identifying root causes, documenting known errors, and implementing permanent solutions.
Problem Management Goals:
- Identify and remove root causes of incidents
- Minimize the impact of incidents that cannot be prevented
- Proactively identify potential issues before they cause incidents
- Document known errors and workarounds
When to use this skill:
- Multiple incidents with same root cause
- Post major incident review
- Recurring service degradation
- Proactive trend analysis
Prerequisites
- Roles:
problem_manager,problem_admin, oritil - Access: Read/write to problem, incident, known_error tables
- Knowledge: Root cause analysis techniques (5 Whys, Fishbone)
- Related Skills:
itsm/incident-lifecycle,itsm/major-incident
Procedure
Phase 1: Problem Identification
Step 1.1: Identify Problem Candidates from Incidents
Find incidents with similar patterns:
Using MCP:
Tool: SN-Query-Table
Parameters:
table_name: incident
query: active=false^stateIN6,7^sys_created_on>=javascript:gs.daysAgoStart(30)
fields: sys_id,number,short_description,category,cmdb_ci,resolution_notes,close_code
limit: 100
Using REST API:
GET /api/now/table/incident?sysparm_query=active=false^stateIN6,7^sys_created_on>=javascript:gs.daysAgoStart(30)&sysparm_fields=sys_id,number,short_description,category,cmdb_ci,resolution_notes,close_code&sysparm_limit=100
Find repeat incidents by CI:
Tool: SN-Query-Table
Parameters:
table_name: incident
query: cmdb_ci=[ci_sys_id]^sys_created_on>=javascript:gs.daysAgoStart(90)
fields: sys_id,number,short_description,resolution_notes,close_code
order_by: sys_created_on
Step 1.2: Analyze Incident Patterns
Grouping Criteria:
- Same Configuration Item (CI)
- Same Category/Subcategory
- Similar short descriptions (keyword match)
- Same assignment group
- Same resolution type
Pattern Analysis Work Notes:
Tool: SN-Add-Work-Notes
Parameters:
sys_id: [analysis_task_sys_id]
work_notes: |
=== INCIDENT PATTERN ANALYSIS ===
Analysis Period: [date range]
PATTERN IDENTIFIED:
- Total Related Incidents: [count]
- Affected CI: [CI name]
- Common Category: [category]
- Keyword Pattern: [keywords]
INCIDENT LIST:
- INC001234 - [date] - [short description]
- INC001235 - [date] - [short description]
- INC001240 - [date] - [short description]
FREQUENCY:
- First occurrence: [date]
- Last occurrence: [date]
- Average frequency: [X per week/month]
RECOMMENDATION:
Create problem record for root cause investigation.
Step 1.3: Create Problem Record
Using MCP:
Tool: SN-Create-Record
Parameters:
table_name: problem
data:
short_description: "[CI/Service] - [Brief description of recurring issue]"
description: |
PROBLEM STATEMENT:
[Clear description of the problem being investigated]
RELATED INCIDENTS:
- INC001234 - [date] - [description]
- INC001235 - [date] - [description]
- INC001240 - [date] - [description]
BUSINESS IMPACT:
- Number of incidents: [count]
- Total downtime: [hours]
- Users affected: [count]
- Business cost: [estimate]
INITIAL HYPOTHESIS:
[Initial theory about root cause]
priority: 2
impact: 2
urgency: 2
category: [category]
cmdb_ci: [ci_sys_id]
assignment_group: [team_sys_id]
Using REST API:
POST /api/now/table/problem
Content-Type: application/json
{
"short_description": "Email Server - Intermittent connection failures",
"description": "PROBLEM STATEMENT:\nUsers experiencing intermittent email connectivity...",
"priority": "2",
"impact": "2",
"urgency": "2",
"category": "software",
"cmdb_ci": "ci_sys_id",
"assignment_group": "group_sys_id"
}
Phase 2: Root Cause Investigation
Step 2.1: Investigation Task Creation
Create investigation tasks for different areas:
Tool: SN-Create-Record
Parameters:
table_name: problem_task
data:
parent: [problem_sys_id]
short_description: "RCA - [Area] Investigation"
description: |
Investigate [specific area] as potential root cause.
INVESTIGATION SCOPE:
- [Item 1 to investigate]
- [Item 2 to investigate]
- [Item 3 to investigate]
EXPECTED DELIVERABLES:
- Findings documented in work notes
- Evidence collected (logs, screenshots)
- Recommendation for next steps
assignment_group: [specialist_team]
priority: 2
Step 2.2: 5-Whys Analysis
Document in problem work notes:
Tool: SN-Add-Work-Notes
Parameters:
sys_id: [problem_sys_id]
work_notes: |
=== 5-WHYS ROOT CAUSE ANALYSIS ===
Analyst: [name]
Date: [date]
PROBLEM STATEMENT:
Email service experiencing intermittent connection failures
WHY 1: Why are connection failures occurring?
Answer: The email server is running out of available connections.
WHY 2: Why is the server running out of connections?
Answer: Connection pool is exhausted due to connections not being released.
WHY 3: Why are connections not being released?
Answer: A memory leak in the email client integration is holding connections.
WHY 4: Why is there a memory leak?
Answer: The integration code doesn't properly handle error conditions.
WHY 5: Why doesn't the code handle error conditions?
Answer: Code review process didn't catch the missing error handling.
ROOT CAUSE:
Missing error handling in email client integration code, combined with
insufficient code review process for integration components.
CONTRIBUTING FACTORS:
- No connection timeout configured
- Monitoring didn't alert on connection pool
- Documentation gap on error handling standards
Step 2.3: Fishbone (Ishikawa) Analysis
Tool: SN-Add-Work-Notes
Parameters:
sys_id: [problem_sys_id]
work_notes: |
=== FISHBONE ANALYSIS ===
Problem: [Problem statement]
PEOPLE:
- [Factor 1]
- [Factor 2]
PROCESS:
- [Factor 1]
- [Factor 2]
TECHNOLOGY:
- [Factor 1]
- [Factor 2]
ENVIRONMENT:
- [Factor 1]
- [Factor 2]
DATA:
- [Factor 1]
- [Factor 2]
EXTERNAL:
- [Factor 1]
- [Factor 2]
PRIMARY ROOT CAUSES:
1. [Root cause from analysis]
2. [Contributing root cause]
Step 2.4: Update Problem with Root Cause
Using MCP:
Tool: SN-Update-Record
Parameters:
table_name: problem
sys_id: [problem_sys_id]
data:
state: 103 # Root Cause Analysis
root_cause: |
ROOT CAUSE IDENTIFIED:
Primary Root Cause:
[Detailed description of the root cause]
Contributing Factors:
1. [Factor 1]
2. [Factor 2]
3. [Factor 3]
Evidence:
- [Log entry/data supporting conclusion]
- [Test result]
- [Other evidence]
Analysis Method: [5-Whys/Fishbone/Fault Tree/Other]
Analysis Date: [date]
Analyst: [name]
Phase 3: Known Error Documentation
Step 3.1: Create Known Error Record
Once root cause is confirmed, document as known error:
Using MCP:
Tool: SN-Create-Record
Parameters:
table_name: known_error
data:
short_description: "[CI] - [Error description]"
description: |
KNOWN ERROR DESCRIPTION:
[Clear description of the error condition]
SYMPTOMS:
- [Symptom 1]
- [Symptom 2]
- [Symptom 3]
ROOT CAUSE:
[Root cause description]
AFFECTED SERVICES/CIS:
- [Service/CI 1]
- [Service/CI 2]
workaround: |
WORKAROUND INSTRUCTIONS:
When to use: [Condition when workaround applies]
Steps:
1. [Step 1]
2. [Step 2]
3. [Step 3]
Expected Result: [What user should see]
Limitations:
- [Limitation 1]
- [Limitation 2]
Contact [team] if workaround does not resolve the issue.
problem: [problem_sys_id]
cmdb_ci: [ci_sys_id]
u_permanent_fix_planned: true
u_fix_date: [target date]
Using REST API:
POST /api/now/table/known_error
Content-Type: application/json
{
"short_description": "Email Server - Connection timeout during peak hours",
"description": "KNOWN ERROR DESCRIPTION:\nEmail connections may timeout...",
"workaround": "WORKAROUND INSTRUCTIONS:\n1. Close and reopen email client...",
"problem": "problem_sys_id",
"cmdb_ci": "ci_sys_id"
}
Step 3.2: Link Known Error to Incidents
Update related incidents:
Tool: SN-Update-Record
Parameters:
table_name: incident
sys_id: [incident_sys_id]
data:
problem_id: [problem_sys_id]
work_notes: "Linked to Known Error [KERR#] - Workaround available. See KB article [KB#]."
Batch update multiple incidents:
Tool: SN-Query-Table
Parameters:
table_name: incident
query: cmdb_ci=[ci_sys_id]^problem_idISEMPTY^stateIN1,2,3
fields: sys_id,number
limit: 50
Then for each:
Tool: SN-Update-Record
Parameters:
table_name: incident
sys_id: [each_incident_sys_id]
data:
problem_id: [problem_sys_id]
Phase 4: Workaround Management
Step 4.1: Document Workaround
Detailed workaround documentation:
Tool: SN-Add-Work-Notes
Parameters:
sys_id: [problem_sys_id]
work_notes: |
=== WORKAROUND DOCUMENTED ===
WORKAROUND ID: WA-[number]
Effective Date: [date]
Author: [name]
APPLICABILITY:
- Applies to: [specific conditions]
- Does NOT apply to: [exclusions]
PREREQUISITES:
- [Prerequisite 1]
- [Prerequisite 2]
PROCEDURE:
1. [Detailed step 1]
Note: [Important note if applicable]
2. [Detailed step 2]
Expected Result: [What to expect]
3. [Detailed step 3]
VERIFICATION:
- [How to verify workaround worked]
KNOWN LIMITATIONS:
- [Limitation 1]
- [Limitation 2]
ROLLBACK PROCEDURE:
If workaround causes issues:
1. [Rollback step 1]
2. [Rollback step 2]
SUPPORT CONTACT:
If workaround fails, contact [team/person] at [contact info]
Step 4.2: Communicate Workaround to Service Desk
Tool: SN-Add-Problem-Comment
Parameters:
sys_id: [problem_sys_id]
comment: |
=== WORKAROUND AVAILABLE FOR SERVICE DESK ===
Problem: [PRB#]
Known Error: [KERR#]
QUICK REFERENCE FOR AGENTS:
Customer Reports: "[Common customer description]"
Solution: [Brief description of workaround]
Steps for Customer:
1. [Simple step 1]
2. [Simple step 2]
3. [Simple step 3]
Escalate if: [Condition for escalation]
Related KB: [KB article link]
Phase 5: Permanent Fix
Step 5.1: Plan Permanent Fix
Create change request for fix:
Tool: SN-Create-Record
Parameters:
table_name: change_request
data:
short_description: "Fix: [Problem description]"
description: |
CHANGE PURPOSE:
Implement permanent fix for problem [PRB#]
ROOT CAUSE ADDRESSED:
[Root cause from problem record]
PROPOSED SOLUTION:
[Technical description of fix]
EXPECTED OUTCOME:
- [Outcome 1]
- [Outcome 2]
TESTING PLAN:
- [Test 1]
- [Test 2]
ROLLBACK PLAN:
- [Rollback step 1]
- [Rollback step 2]
type: normal
priority: 2
assignment_group: [development_team]
u_related_problem: [problem_sys_id]
Step 5.2: Track Fix Progress
Tool: SN-Update-Record
Parameters:
table_name: problem
sys_id: [problem_sys_id]
data:
state: 104 # Fix in Progress
fix: |
PERMANENT FIX PLAN:
Solution: [Description of permanent fix]
Implementation Method:
- Change Request: [CHG#]
- Target Date: [date]
- Implementation Team: [team]
Technical Details:
[Detailed technical fix description]
Validation Criteria:
- [ ] [Criterion 1]
- [ ] [Criterion 2]
- [ ] [Criterion 3]
Phase 6: Problem Closure
Step 6.1: Verify Fix Effectiveness
Post-fix monitoring:
Tool: SN-Query-Table
Parameters:
table_name: incident
query: cmdb_ci=[ci_sys_id]^sys_created_on>=javascript:gs.daysAgoStart(14)^problem_id=[problem_sys_id]
fields: sys_id,number,short_description,sys_created_on
limit: 50
Document verification:
Tool: SN-Add-Work-Notes
Parameters:
sys_id: [problem_sys_id]
work_notes: |
=== FIX VERIFICATION ===
Verification Period: [date range]
METRICS:
- Incidents before fix: [count] per [period]
- Incidents after fix: [count] per [period]
- Reduction: [percentage]
MONITORING DATA:
- [Metric 1]: [before] → [after]
- [Metric 2]: [before] → [after]
USER FEEDBACK:
- [Feedback item 1]
- [Feedback item 2]
VERIFICATION STATUS: [Pass/Fail/Partial]
RECOMMENDATION: [Close problem/Continue monitoring/Additional action]
Step 6.2: Close Problem Record
Using MCP:
Tool: SN-Close-Problem
Parameters:
sys_id: [problem_sys_id]
close_code: Fix Applied
close_notes: |
PROBLEM CLOSURE SUMMARY:
Root Cause: [Summary]
Resolution: [What was done]
Implementation:
- Change: [CHG#]
- Date: [implementation date]
Effectiveness:
- Incident reduction: [percentage]
- Monitoring period: [dates]
- No recurrence confirmed
Documentation:
- Known Error: [KERR#]
- Knowledge Article: [KB#]
Lessons Learned:
- [Lesson 1]
- [Lesson 2]
Using REST API:
PATCH /api/now/table/problem/{sys_id}
Content-Type: application/json
{
"state": "107",
"close_code": "Fix Applied",
"close_notes": "PROBLEM CLOSURE SUMMARY:\n\nRoot Cause: Memory leak in email integration...",
"resolved_at": "2024-01-15 14:30:00",
"resolved_by": "admin"
}
Problem States Reference
┌────────────┐ ┌────────────────┐ ┌─────────────┐
│ New │────►│ Root Cause │────►│ Fix in │
│ (101) │ │ Analysis (103) │ │ Progress │
└────────────┘ └────────────────┘ │ (104) │
│ └──────┬──────┘
▼ │
┌───────────────┐ │
│ Known Error │ │
│ (102) │◄─────────────┘
└───────┬───────┘
│
▼
┌───────────────────────────────┐
│ Resolved │
│ (106) │
└───────────────┬───────────────┘
│
▼
┌───────────────────────────────┐
│ Closed │
│ (107) │
└───────────────────────────────┘
Tool Usage Summary
MCP Tools
| Tool | Purpose | Phase |
|---|---|---|
SN-List-Problems |
List existing problems | 1 |
SN-Query-Table |
Find incident patterns, verify fix | 1, 6 |
SN-Create-Record |
Create problem, known error, tasks | 1, 2, 3 |
SN-Update-Record |
Update problem status, root cause, fix | 2, 5 |
SN-Add-Work-Notes |
Document analysis and findings | All |
SN-Add-Problem-Comment |
Customer/Service Desk communication | 4 |
SN-Close-Problem |
Close resolved problem | 6 |
REST API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/api/now/table/problem |
GET | List problems |
/api/now/table/problem |
POST | Create problem |
/api/now/table/problem/{sys_id} |
PATCH | Update problem |
/api/now/table/problem_task |
POST | Create investigation task |
/api/now/table/known_error |
POST | Create known error |
/api/now/table/incident |
GET | Query related incidents |
Best Practices
- Data-Driven: Use incident data to identify problems, not assumptions
- Structured Analysis: Always use formal RCA techniques (5-Whys, Fishbone)
- Document Everything: Future investigators benefit from detailed notes
- Workaround First: Get temporary relief while working on permanent fix
- Verify Effectiveness: Don't close problems without confirming fix works
- Knowledge Management: Convert findings into KB articles
- ITIL Alignment: Problem Management aims to reduce incident volume and impact
Troubleshooting
"No related incidents found"
Cause: Query criteria too restrictive or wrong CI Solution: Broaden date range; verify CI sys_id; try keyword search
"Root cause unclear after analysis"
Cause: Insufficient data or multiple contributing factors Solution: Gather more data; involve additional SMEs; consider environmental factors
"Workaround not effective"
Cause: Workaround doesn't address all scenarios Solution: Refine workaround; document limitations; create separate workaround for other scenarios
"Problem keeps reopening"
Cause: Root cause not fully addressed; new variation of same issue Solution: Review if truly same problem; may need new problem record for variation
RCA Templates
5-Whys Template
Problem: [Statement]
Why 1: [Question]
Because: [Answer]
Why 2: [Question based on Why 1 answer]
Because: [Answer]
Why 3: [Question based on Why 2 answer]
Because: [Answer]
Why 4: [Question based on Why 3 answer]
Because: [Answer]
Why 5: [Question based on Why 4 answer]
Because: [Answer - typically root cause]
Root Cause: [Summary]
Fishbone Template
Problem: _________________________________
Categories:
PEOPLE PROCESS TECHNOLOGY ENVIRONMENT
| | | |
+-- [cause] +-- [cause] +-- [cause] +-- [cause]
+-- [cause] +-- [cause] +-- [cause] +-- [cause]
Root Causes Identified:
1. [Primary root cause]
2. [Secondary root cause]
Related Skills
itsm/incident-lifecycle- Incident managementitsm/incident-triage- Incident triageitsm/major-incident- Major incident handlingitsm/change-management- Change for permanent fixesadmin/knowledge-management- Converting to KB articles