problem-analysis

star 30

Root cause analysis and problem management including known error documentation, workaround management, and permanent fix tracking

Happy-Technologies-LLC By Happy-Technologies-LLC schedule Updated 2/6/2026

name: problem-analysis version: 1.0.0 description: Root cause analysis and problem management including known error documentation, workaround management, and permanent fix tracking author: Happy Technologies LLC tags: [itsm, problem, rca, root-cause, known-error, workaround, itil] platforms: [claude-code, claude-desktop, chatgpt, cursor, any] tools: mcp: - SN-List-Problems - SN-Add-Problem-Comment - SN-Close-Problem - SN-Query-Table - SN-Create-Record - SN-Update-Record - SN-Add-Work-Notes rest: - /api/now/table/problem - /api/now/table/problem_task - /api/now/table/incident - /api/now/table/known_error - /api/now/table/sys_journal_field native: - Bash complexity: advanced

estimated_time: 30-90 minutes

Problem Analysis and Root Cause Investigation

Overview

This skill provides a comprehensive framework for Problem Management in ServiceNow, focusing on identifying root causes, documenting known errors, and implementing permanent solutions.

Problem Management Goals:

  • Identify and remove root causes of incidents
  • Minimize the impact of incidents that cannot be prevented
  • Proactively identify potential issues before they cause incidents
  • Document known errors and workarounds

When to use this skill:

  • Multiple incidents with same root cause
  • Post major incident review
  • Recurring service degradation
  • Proactive trend analysis

Prerequisites

  • Roles: problem_manager, problem_admin, or itil
  • Access: Read/write to problem, incident, known_error tables
  • Knowledge: Root cause analysis techniques (5 Whys, Fishbone)
  • Related Skills: itsm/incident-lifecycle, itsm/major-incident

Procedure

Phase 1: Problem Identification

Step 1.1: Identify Problem Candidates from Incidents

Find incidents with similar patterns:

Using MCP:

Tool: SN-Query-Table
Parameters:
  table_name: incident
  query: active=false^stateIN6,7^sys_created_on>=javascript:gs.daysAgoStart(30)
  fields: sys_id,number,short_description,category,cmdb_ci,resolution_notes,close_code
  limit: 100

Using REST API:

GET /api/now/table/incident?sysparm_query=active=false^stateIN6,7^sys_created_on>=javascript:gs.daysAgoStart(30)&sysparm_fields=sys_id,number,short_description,category,cmdb_ci,resolution_notes,close_code&sysparm_limit=100

Find repeat incidents by CI:

Tool: SN-Query-Table
Parameters:
  table_name: incident
  query: cmdb_ci=[ci_sys_id]^sys_created_on>=javascript:gs.daysAgoStart(90)
  fields: sys_id,number,short_description,resolution_notes,close_code
  order_by: sys_created_on

Step 1.2: Analyze Incident Patterns

Grouping Criteria:

  • Same Configuration Item (CI)
  • Same Category/Subcategory
  • Similar short descriptions (keyword match)
  • Same assignment group
  • Same resolution type

Pattern Analysis Work Notes:

Tool: SN-Add-Work-Notes
Parameters:
  sys_id: [analysis_task_sys_id]
  work_notes: |
    === INCIDENT PATTERN ANALYSIS ===
    Analysis Period: [date range]

    PATTERN IDENTIFIED:
    - Total Related Incidents: [count]
    - Affected CI: [CI name]
    - Common Category: [category]
    - Keyword Pattern: [keywords]

    INCIDENT LIST:
    - INC001234 - [date] - [short description]
    - INC001235 - [date] - [short description]
    - INC001240 - [date] - [short description]

    FREQUENCY:
    - First occurrence: [date]
    - Last occurrence: [date]
    - Average frequency: [X per week/month]

    RECOMMENDATION:
    Create problem record for root cause investigation.

Step 1.3: Create Problem Record

Using MCP:

Tool: SN-Create-Record
Parameters:
  table_name: problem
  data:
    short_description: "[CI/Service] - [Brief description of recurring issue]"
    description: |
      PROBLEM STATEMENT:
      [Clear description of the problem being investigated]

      RELATED INCIDENTS:
      - INC001234 - [date] - [description]
      - INC001235 - [date] - [description]
      - INC001240 - [date] - [description]

      BUSINESS IMPACT:
      - Number of incidents: [count]
      - Total downtime: [hours]
      - Users affected: [count]
      - Business cost: [estimate]

      INITIAL HYPOTHESIS:
      [Initial theory about root cause]
    priority: 2
    impact: 2
    urgency: 2
    category: [category]
    cmdb_ci: [ci_sys_id]
    assignment_group: [team_sys_id]

Using REST API:

POST /api/now/table/problem
Content-Type: application/json

{
  "short_description": "Email Server - Intermittent connection failures",
  "description": "PROBLEM STATEMENT:\nUsers experiencing intermittent email connectivity...",
  "priority": "2",
  "impact": "2",
  "urgency": "2",
  "category": "software",
  "cmdb_ci": "ci_sys_id",
  "assignment_group": "group_sys_id"
}

Phase 2: Root Cause Investigation

Step 2.1: Investigation Task Creation

Create investigation tasks for different areas:

Tool: SN-Create-Record
Parameters:
  table_name: problem_task
  data:
    parent: [problem_sys_id]
    short_description: "RCA - [Area] Investigation"
    description: |
      Investigate [specific area] as potential root cause.

      INVESTIGATION SCOPE:
      - [Item 1 to investigate]
      - [Item 2 to investigate]
      - [Item 3 to investigate]

      EXPECTED DELIVERABLES:
      - Findings documented in work notes
      - Evidence collected (logs, screenshots)
      - Recommendation for next steps
    assignment_group: [specialist_team]
    priority: 2

Step 2.2: 5-Whys Analysis

Document in problem work notes:

Tool: SN-Add-Work-Notes
Parameters:
  sys_id: [problem_sys_id]
  work_notes: |
    === 5-WHYS ROOT CAUSE ANALYSIS ===
    Analyst: [name]
    Date: [date]

    PROBLEM STATEMENT:
    Email service experiencing intermittent connection failures

    WHY 1: Why are connection failures occurring?
    Answer: The email server is running out of available connections.

    WHY 2: Why is the server running out of connections?
    Answer: Connection pool is exhausted due to connections not being released.

    WHY 3: Why are connections not being released?
    Answer: A memory leak in the email client integration is holding connections.

    WHY 4: Why is there a memory leak?
    Answer: The integration code doesn't properly handle error conditions.

    WHY 5: Why doesn't the code handle error conditions?
    Answer: Code review process didn't catch the missing error handling.

    ROOT CAUSE:
    Missing error handling in email client integration code, combined with
    insufficient code review process for integration components.

    CONTRIBUTING FACTORS:
    - No connection timeout configured
    - Monitoring didn't alert on connection pool
    - Documentation gap on error handling standards

Step 2.3: Fishbone (Ishikawa) Analysis

Tool: SN-Add-Work-Notes
Parameters:
  sys_id: [problem_sys_id]
  work_notes: |
    === FISHBONE ANALYSIS ===
    Problem: [Problem statement]

    PEOPLE:
    - [Factor 1]
    - [Factor 2]

    PROCESS:
    - [Factor 1]
    - [Factor 2]

    TECHNOLOGY:
    - [Factor 1]
    - [Factor 2]

    ENVIRONMENT:
    - [Factor 1]
    - [Factor 2]

    DATA:
    - [Factor 1]
    - [Factor 2]

    EXTERNAL:
    - [Factor 1]
    - [Factor 2]

    PRIMARY ROOT CAUSES:
    1. [Root cause from analysis]
    2. [Contributing root cause]

Step 2.4: Update Problem with Root Cause

Using MCP:

Tool: SN-Update-Record
Parameters:
  table_name: problem
  sys_id: [problem_sys_id]
  data:
    state: 103  # Root Cause Analysis
    root_cause: |
      ROOT CAUSE IDENTIFIED:

      Primary Root Cause:
      [Detailed description of the root cause]

      Contributing Factors:
      1. [Factor 1]
      2. [Factor 2]
      3. [Factor 3]

      Evidence:
      - [Log entry/data supporting conclusion]
      - [Test result]
      - [Other evidence]

      Analysis Method: [5-Whys/Fishbone/Fault Tree/Other]
      Analysis Date: [date]
      Analyst: [name]

Phase 3: Known Error Documentation

Step 3.1: Create Known Error Record

Once root cause is confirmed, document as known error:

Using MCP:

Tool: SN-Create-Record
Parameters:
  table_name: known_error
  data:
    short_description: "[CI] - [Error description]"
    description: |
      KNOWN ERROR DESCRIPTION:
      [Clear description of the error condition]

      SYMPTOMS:
      - [Symptom 1]
      - [Symptom 2]
      - [Symptom 3]

      ROOT CAUSE:
      [Root cause description]

      AFFECTED SERVICES/CIS:
      - [Service/CI 1]
      - [Service/CI 2]
    workaround: |
      WORKAROUND INSTRUCTIONS:

      When to use: [Condition when workaround applies]

      Steps:
      1. [Step 1]
      2. [Step 2]
      3. [Step 3]

      Expected Result: [What user should see]

      Limitations:
      - [Limitation 1]
      - [Limitation 2]

      Contact [team] if workaround does not resolve the issue.
    problem: [problem_sys_id]
    cmdb_ci: [ci_sys_id]
    u_permanent_fix_planned: true
    u_fix_date: [target date]

Using REST API:

POST /api/now/table/known_error
Content-Type: application/json

{
  "short_description": "Email Server - Connection timeout during peak hours",
  "description": "KNOWN ERROR DESCRIPTION:\nEmail connections may timeout...",
  "workaround": "WORKAROUND INSTRUCTIONS:\n1. Close and reopen email client...",
  "problem": "problem_sys_id",
  "cmdb_ci": "ci_sys_id"
}

Step 3.2: Link Known Error to Incidents

Update related incidents:

Tool: SN-Update-Record
Parameters:
  table_name: incident
  sys_id: [incident_sys_id]
  data:
    problem_id: [problem_sys_id]
    work_notes: "Linked to Known Error [KERR#] - Workaround available. See KB article [KB#]."

Batch update multiple incidents:

Tool: SN-Query-Table
Parameters:
  table_name: incident
  query: cmdb_ci=[ci_sys_id]^problem_idISEMPTY^stateIN1,2,3
  fields: sys_id,number
  limit: 50

Then for each:

Tool: SN-Update-Record
Parameters:
  table_name: incident
  sys_id: [each_incident_sys_id]
  data:
    problem_id: [problem_sys_id]

Phase 4: Workaround Management

Step 4.1: Document Workaround

Detailed workaround documentation:

Tool: SN-Add-Work-Notes
Parameters:
  sys_id: [problem_sys_id]
  work_notes: |
    === WORKAROUND DOCUMENTED ===

    WORKAROUND ID: WA-[number]
    Effective Date: [date]
    Author: [name]

    APPLICABILITY:
    - Applies to: [specific conditions]
    - Does NOT apply to: [exclusions]

    PREREQUISITES:
    - [Prerequisite 1]
    - [Prerequisite 2]

    PROCEDURE:
    1. [Detailed step 1]
       Note: [Important note if applicable]

    2. [Detailed step 2]
       Expected Result: [What to expect]

    3. [Detailed step 3]

    VERIFICATION:
    - [How to verify workaround worked]

    KNOWN LIMITATIONS:
    - [Limitation 1]
    - [Limitation 2]

    ROLLBACK PROCEDURE:
    If workaround causes issues:
    1. [Rollback step 1]
    2. [Rollback step 2]

    SUPPORT CONTACT:
    If workaround fails, contact [team/person] at [contact info]

Step 4.2: Communicate Workaround to Service Desk

Tool: SN-Add-Problem-Comment
Parameters:
  sys_id: [problem_sys_id]
  comment: |
    === WORKAROUND AVAILABLE FOR SERVICE DESK ===

    Problem: [PRB#]
    Known Error: [KERR#]

    QUICK REFERENCE FOR AGENTS:

    Customer Reports: "[Common customer description]"

    Solution: [Brief description of workaround]

    Steps for Customer:
    1. [Simple step 1]
    2. [Simple step 2]
    3. [Simple step 3]

    Escalate if: [Condition for escalation]

    Related KB: [KB article link]

Phase 5: Permanent Fix

Step 5.1: Plan Permanent Fix

Create change request for fix:

Tool: SN-Create-Record
Parameters:
  table_name: change_request
  data:
    short_description: "Fix: [Problem description]"
    description: |
      CHANGE PURPOSE:
      Implement permanent fix for problem [PRB#]

      ROOT CAUSE ADDRESSED:
      [Root cause from problem record]

      PROPOSED SOLUTION:
      [Technical description of fix]

      EXPECTED OUTCOME:
      - [Outcome 1]
      - [Outcome 2]

      TESTING PLAN:
      - [Test 1]
      - [Test 2]

      ROLLBACK PLAN:
      - [Rollback step 1]
      - [Rollback step 2]
    type: normal
    priority: 2
    assignment_group: [development_team]
    u_related_problem: [problem_sys_id]

Step 5.2: Track Fix Progress

Tool: SN-Update-Record
Parameters:
  table_name: problem
  sys_id: [problem_sys_id]
  data:
    state: 104  # Fix in Progress
    fix: |
      PERMANENT FIX PLAN:

      Solution: [Description of permanent fix]

      Implementation Method:
      - Change Request: [CHG#]
      - Target Date: [date]
      - Implementation Team: [team]

      Technical Details:
      [Detailed technical fix description]

      Validation Criteria:
      - [ ] [Criterion 1]
      - [ ] [Criterion 2]
      - [ ] [Criterion 3]

Phase 6: Problem Closure

Step 6.1: Verify Fix Effectiveness

Post-fix monitoring:

Tool: SN-Query-Table
Parameters:
  table_name: incident
  query: cmdb_ci=[ci_sys_id]^sys_created_on>=javascript:gs.daysAgoStart(14)^problem_id=[problem_sys_id]
  fields: sys_id,number,short_description,sys_created_on
  limit: 50

Document verification:

Tool: SN-Add-Work-Notes
Parameters:
  sys_id: [problem_sys_id]
  work_notes: |
    === FIX VERIFICATION ===

    Verification Period: [date range]

    METRICS:
    - Incidents before fix: [count] per [period]
    - Incidents after fix: [count] per [period]
    - Reduction: [percentage]

    MONITORING DATA:
    - [Metric 1]: [before] → [after]
    - [Metric 2]: [before] → [after]

    USER FEEDBACK:
    - [Feedback item 1]
    - [Feedback item 2]

    VERIFICATION STATUS: [Pass/Fail/Partial]

    RECOMMENDATION: [Close problem/Continue monitoring/Additional action]

Step 6.2: Close Problem Record

Using MCP:

Tool: SN-Close-Problem
Parameters:
  sys_id: [problem_sys_id]
  close_code: Fix Applied
  close_notes: |
    PROBLEM CLOSURE SUMMARY:

    Root Cause: [Summary]

    Resolution: [What was done]

    Implementation:
    - Change: [CHG#]
    - Date: [implementation date]

    Effectiveness:
    - Incident reduction: [percentage]
    - Monitoring period: [dates]
    - No recurrence confirmed

    Documentation:
    - Known Error: [KERR#]
    - Knowledge Article: [KB#]

    Lessons Learned:
    - [Lesson 1]
    - [Lesson 2]

Using REST API:

PATCH /api/now/table/problem/{sys_id}
Content-Type: application/json

{
  "state": "107",
  "close_code": "Fix Applied",
  "close_notes": "PROBLEM CLOSURE SUMMARY:\n\nRoot Cause: Memory leak in email integration...",
  "resolved_at": "2024-01-15 14:30:00",
  "resolved_by": "admin"
}

Problem States Reference

┌────────────┐     ┌────────────────┐     ┌─────────────┐
│    New     │────►│ Root Cause     │────►│ Fix in      │
│   (101)    │     │ Analysis (103) │     │ Progress    │
└────────────┘     └────────────────┘     │   (104)     │
                           │               └──────┬──────┘
                           ▼                      │
                   ┌───────────────┐              │
                   │ Known Error   │              │
                   │   (102)       │◄─────────────┘
                   └───────┬───────┘
                           │
                           ▼
           ┌───────────────────────────────┐
           │          Resolved             │
           │           (106)               │
           └───────────────┬───────────────┘
                           │
                           ▼
           ┌───────────────────────────────┐
           │           Closed              │
           │           (107)               │
           └───────────────────────────────┘

Tool Usage Summary

MCP Tools

Tool Purpose Phase
SN-List-Problems List existing problems 1
SN-Query-Table Find incident patterns, verify fix 1, 6
SN-Create-Record Create problem, known error, tasks 1, 2, 3
SN-Update-Record Update problem status, root cause, fix 2, 5
SN-Add-Work-Notes Document analysis and findings All
SN-Add-Problem-Comment Customer/Service Desk communication 4
SN-Close-Problem Close resolved problem 6

REST API Endpoints

Endpoint Method Purpose
/api/now/table/problem GET List problems
/api/now/table/problem POST Create problem
/api/now/table/problem/{sys_id} PATCH Update problem
/api/now/table/problem_task POST Create investigation task
/api/now/table/known_error POST Create known error
/api/now/table/incident GET Query related incidents

Best Practices

  • Data-Driven: Use incident data to identify problems, not assumptions
  • Structured Analysis: Always use formal RCA techniques (5-Whys, Fishbone)
  • Document Everything: Future investigators benefit from detailed notes
  • Workaround First: Get temporary relief while working on permanent fix
  • Verify Effectiveness: Don't close problems without confirming fix works
  • Knowledge Management: Convert findings into KB articles
  • ITIL Alignment: Problem Management aims to reduce incident volume and impact

Troubleshooting

"No related incidents found"

Cause: Query criteria too restrictive or wrong CI Solution: Broaden date range; verify CI sys_id; try keyword search

"Root cause unclear after analysis"

Cause: Insufficient data or multiple contributing factors Solution: Gather more data; involve additional SMEs; consider environmental factors

"Workaround not effective"

Cause: Workaround doesn't address all scenarios Solution: Refine workaround; document limitations; create separate workaround for other scenarios

"Problem keeps reopening"

Cause: Root cause not fully addressed; new variation of same issue Solution: Review if truly same problem; may need new problem record for variation

RCA Templates

5-Whys Template

Problem: [Statement]

Why 1: [Question]
Because: [Answer]

Why 2: [Question based on Why 1 answer]
Because: [Answer]

Why 3: [Question based on Why 2 answer]
Because: [Answer]

Why 4: [Question based on Why 3 answer]
Because: [Answer]

Why 5: [Question based on Why 4 answer]
Because: [Answer - typically root cause]

Root Cause: [Summary]

Fishbone Template

Problem: _________________________________

Categories:
PEOPLE    PROCESS    TECHNOLOGY    ENVIRONMENT
  |          |           |              |
  +-- [cause] +-- [cause] +-- [cause]   +-- [cause]
  +-- [cause] +-- [cause] +-- [cause]   +-- [cause]

Root Causes Identified:
1. [Primary root cause]
2. [Secondary root cause]

Related Skills

  • itsm/incident-lifecycle - Incident management
  • itsm/incident-triage - Incident triage
  • itsm/major-incident - Major incident handling
  • itsm/change-management - Change for permanent fixes
  • admin/knowledge-management - Converting to KB articles

References

Install via CLI
npx skills add https://github.com/Happy-Technologies-LLC/happy-platform-skills --skill problem-analysis
Repository Details
star Stars 30
call_split Forks 11
navigation Branch main
article Path SKILL.md
More from Creator
Happy-Technologies-LLC
Happy-Technologies-LLC Explore all skills →