hierarchical-clustering-plot

star 1.2k

Use when building a sample-level hierarchical clustering dendrogram from a bulk expression matrix and sample annotation table, especially for QC, batch inspection, or sample similarity assessment. Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity. NOT for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows.

aipoch By aipoch schedule Updated 5/28/2026

name: hierarchical-clustering-plot description: "Use when building a sample-level hierarchical clustering dendrogram from a bulk expression matrix and sample annotation table, especially for QC, batch inspection, or sample similarity assessment. Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity. NOT for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows."

Hierarchical Clustering Plot

When to Use

Use this skill when you need a sample-level hierarchical clustering dendrogram from a bulk expression matrix and a sample annotation table.

  • Good fits: sample QC, batch inspection, sample similarity assessment, checking whether annotated sample groups cluster as expected.
  • Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity.
  • Not for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows.

When to Read External Files

Situation File to Read Purpose
Need algorithm details references/algorithm.md Distance calculation, linkage rules, and clustering assumptions
Need to run analysis or inspect CLI entrypoint behavior scripts/main.R Execute the workflow and inspect argument parsing, defaults, required flags, and sourced modules
Need workflow implementation details scripts/run_analysis.R See orchestration order, temp workspace handling, and output generation
Need logging or warning behavior scripts/logging_utils.R See standardized console log formatting and memory usage messages
Need file or parameter validation details scripts/validation_utils.R See path checks, output-directory checks, and scalar validation
Need timeout, temp workspace, or session info behavior scripts/runtime_utils.R See timeout control, temp cleanup, output copying, and session-info export
Need expression/group input handling scripts/input_functions.R See CSV loading, sample matching, and label extraction
Need clustering logic scripts/clustering_functions.R See distance calculation and hclust() generation
Need output-writing logic scripts/output_utils.R See CSV export and PDF rendering
Encounter errors, warnings, or unexpected clustering patterns references/troubleshooting.md Common failures, warning follow-up, and interpretation guidance
Need CLI examples or common parameter combinations references/cli-guide.md Detailed command patterns for standard, variant, and test runs
Need example input files or schema-concrete fixtures tests/data/ Inspect sample CSV layouts for expression and group inputs
Need expected output names or artifact formats ## Output Files and references/cli-guide.md Confirm the files the workflow writes and inspect documented example previews
Need to run regression tests tests/run_tests.R Execute the automated test suite
Need exact test assertions or edge cases tests/testthat/test-clustering.R Inspect validation, reproducibility, and output checks

Usage

Rscript scripts/main.R \
  --input_file ./expression_matrix.csv \
  --group_file ./sample_groups.csv \
  --output_dir ./output/ \
  --distance_method euclidean \
  --linkage_method complete \
  --label_column batch \
  --timeout_seconds 300 \
  --seed 42

Arguments

Short Long Type Default Description
-i --input_file character required Expression matrix file (features as rows, samples as columns)
-g --group_file character required Sample annotation file (first column sample ID, one metadata column for labels)
-o --output_dir character ./output/ Output directory
-d --distance_method character euclidean Distance metric for dist(): euclidean, maximum, manhattan, canberra, binary, minkowski
-m --linkage_method character complete Linkage method for hclust(): complete, single, average, mcquitty, median, centroid, ward.D, ward.D2
-l --label_column character second column Column used as dendrogram labels
-c --label_cex numeric 0.8 Dendrogram label size, must be > 0
-t --timeout_seconds integer 300 Elapsed time limit in seconds, must be > 0
-s --seed integer 42 Random seed for reproducibility

Input Format

Expression Matrix (input_file)

Features as rows, samples as columns, CSV format with feature IDs in the first column.

,Sample01,Sample02,Sample03
TSPAN6,1.847876677,1.831755661,3.827625975
TNMD,0.034919984,0.053250385,1.388850793

Requirements:

  • The first column contains unique feature IDs.
  • All sample columns must be numeric.
  • Sample column names must be unique and non-empty.
  • At least two matched samples are required.

Sample Annotation (group_file)

CSV with sample IDs in the first column. The second column is used by default for leaf labels unless --label_column is provided.

sample,batch
Sample01,batch1
Sample02,batch2
Sample03,batch1

Requirements:

  • Sample IDs must match expression matrix column names exactly.
  • The selected label column must exist and contain no empty values.
  • The file must contain at least one metadata column in addition to sample IDs.

Output Files

File Description
hierarchical_clustering_plot.pdf Sample dendrogram plot
sample_distance_matrix.csv Pairwise sample distance matrix
clustering_order.csv Leaf order shown in the dendrogram
matched_samples.csv Sample-to-label table used for plotting
session_info.txt R session and package version info

Workflow

Step 1: Validate Input

WHEN checking file or parameter validation, READ: scripts/validation_utils.R

WHEN checking expression/group CSV handling, READ: scripts/input_functions.R

  • Check file existence
  • Reject empty files before parsing
  • Read the expression matrix and sample annotation CSV files
  • Validate required columns, unique IDs, and numeric expression values

Step 2: Align Samples

WHEN checking sample matching logic, READ: scripts/input_functions.R

  • Match sample IDs between the annotation file and expression matrix
  • Reorder matrix columns to the annotation file order
  • Select the label column used for plotting

Step 3: Build Hierarchical Clustering

WHEN interpreting distance or linkage behavior, READ: references/algorithm.md

WHEN checking clustering implementation, READ: scripts/clustering_functions.R

  • Transpose the expression matrix to sample-by-feature form
  • Compute pairwise sample distances with dist()
  • Build the dendrogram with hclust()

Step 4: Save Outputs

WHEN checking output staging and cleanup behavior, READ: scripts/run_analysis.R

WHEN checking PDF/CSV export behavior, READ: scripts/output_utils.R

WHEN checking timeout, session info, or final file copy behavior, READ: scripts/runtime_utils.R

  • Stage outputs in a temporary workspace
  • Export the pairwise distance matrix
  • Export the plotted leaf order
  • Render the dendrogram as PDF
  • Copy finalized outputs into the requested output directory

Methods

Distance Matrix

Sample distances are computed from the transposed expression matrix using base R dist().

Hierarchical Clustering

The clustering tree is built with base R hclust(). The default linkage method is complete, matching the source analysis script.


Examples

Basic Usage

Rscript scripts/main.R \
  -i tests/data/sample_expression_matrix.csv \
  -g tests/data/sample_groups.csv \
  -o ./output/ \
  -t 300

Use Sample IDs as Labels

Rscript scripts/main.R \
  -i tests/data/sample_expression_matrix.csv \
  -g tests/data/sample_groups.csv \
  -o ./output_sample_labels/ \
  -l sample

Use Average Linkage

Rscript scripts/main.R \
  -i tests/data/sample_expression_matrix.csv \
  -g tests/data/sample_groups.csv \
  -o ./output_average/ \
  -m average

Error Handling

Common Errors

Error Cause Solution Read More
SKILL_DEPENDENCY_MISSING Required R package is not installed Install the missing package and rerun references/troubleshooting.md#skill_dependency_missing
SKILL_FILE_NOT_FOUND Input file does not exist or output directory could not be created Check the path and permissions references/troubleshooting.md#skill_file_not_found
SKILL_EMPTY_FILE Input file is empty Re-export the CSV and confirm it contains data references/troubleshooting.md#skill_empty_file
SKILL_EMPTY_DATA CSV parsed successfully but contains no data rows Confirm the CSV has at least one data row references/troubleshooting.md#skill_empty_data
SKILL_PARSE_ERROR CSV parsing failed Check encoding, delimiters, and CSV structure references/troubleshooting.md#skill_parse_error
SKILL_MISSING_COLUMNS Expected columns or headers are missing Check CSV headers and metadata columns references/troubleshooting.md#skill_missing_columns
SKILL_INVALID_TYPE Expression values or parameters have the wrong type Ensure numeric fields are numeric references/troubleshooting.md#skill_invalid_type
SKILL_SAMPLE_MISMATCH Sample IDs do not match Ensure the first column in group_file matches matrix column names references/troubleshooting.md#skill_sample_mismatch
SKILL_INVALID_DATA Expression or annotation data is malformed Check duplicate IDs, missing labels, and numeric values references/troubleshooting.md#skill_invalid_data
SKILL_INVALID_PARAMETER Unsupported distance, linkage, or label parameter Use one of the documented parameter values references/troubleshooting.md#skill_invalid_parameter
SKILL_TIMEOUT Analysis exceeded the time limit Increase --timeout_seconds and rerun references/troubleshooting.md#skill_timeout
SKILL_PLOT_ERROR Plot device failed while writing PDF Check output directory permissions and rerun references/troubleshooting.md#skill_plot_error
SKILL_WRITE_ERROR Output or intermediate files could not be written Check output directory permissions and free disk space references/troubleshooting.md#skill_write_error
SKILL_WARNING Non-fatal warning occurred during execution Inspect console warnings and verify output quality references/troubleshooting.md#skill_warning
SKILL_MEMORY_WARNING Memory usage exceeded the warning threshold Reduce input size or rerun with more memory references/troubleshooting.md#skill_memory_warning

IF error persists, READ: references/troubleshooting.md


Testing

Test with Sample Data

# Check help
Rscript scripts/main.R --help

# Run with sample data
Rscript scripts/main.R \
  -i tests/data/sample_expression_matrix.csv \
  -g tests/data/sample_groups.csv \
  -o ./output/

# Run unit tests (requires testthat and data.table)
Rscript tests/run_tests.R

Validation Commands

# Check main output plot exists
ls -la ./output/hierarchical_clustering_plot.pdf

# Inspect clustering order
wc -l ./output/clustering_order.csv

Implementation Checklist

  • CLI parsing with optparse
  • set.seed() for reproducibility
  • Input validation (file existence, emptiness, types, required columns)
  • Try-catch based fatal error handling
  • Standardized SKILL_* error classification
  • Timeout control with setTimeLimit()
  • Standardized console-only logging
  • Base R clustering implementation
  • Session info recording with sink()
  • Temporary workspace cleanup with on.exit()
  • Memory usage reporting with gc()
  • File reading instructions in SKILL.md
  • Modular script structure across scripts/
  • Test template added under tests/testthat/
  • Test data provided
  • Error handling with SKILL_* codes
  • get_script_dir() defined before use
  • Scripts in scripts/ directory
  • References in references/ directory

Last updated: 2026-04-16 | Version: 1.0.0

Install via CLI
npx skills add https://github.com/aipoch/medical-research-skills --skill hierarchical-clustering-plot
Repository Details
star Stars 1,153
call_split Forks 78
navigation Branch main
article Path SKILL.md
More from Creator