name: hierarchical-clustering-plot description: "Use when building a sample-level hierarchical clustering dendrogram from a bulk expression matrix and sample annotation table, especially for QC, batch inspection, or sample similarity assessment. Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity. NOT for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows."
Hierarchical Clustering Plot
When to Use
Use this skill when you need a sample-level hierarchical clustering dendrogram from a bulk expression matrix and a sample annotation table.
- Good fits: sample QC, batch inspection, sample similarity assessment, checking whether annotated sample groups cluster as expected.
- Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity.
- Not for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows.
When to Read External Files
| Situation | File to Read | Purpose |
|---|---|---|
| Need algorithm details | references/algorithm.md |
Distance calculation, linkage rules, and clustering assumptions |
| Need to run analysis or inspect CLI entrypoint behavior | scripts/main.R |
Execute the workflow and inspect argument parsing, defaults, required flags, and sourced modules |
| Need workflow implementation details | scripts/run_analysis.R |
See orchestration order, temp workspace handling, and output generation |
| Need logging or warning behavior | scripts/logging_utils.R |
See standardized console log formatting and memory usage messages |
| Need file or parameter validation details | scripts/validation_utils.R |
See path checks, output-directory checks, and scalar validation |
| Need timeout, temp workspace, or session info behavior | scripts/runtime_utils.R |
See timeout control, temp cleanup, output copying, and session-info export |
| Need expression/group input handling | scripts/input_functions.R |
See CSV loading, sample matching, and label extraction |
| Need clustering logic | scripts/clustering_functions.R |
See distance calculation and hclust() generation |
| Need output-writing logic | scripts/output_utils.R |
See CSV export and PDF rendering |
| Encounter errors, warnings, or unexpected clustering patterns | references/troubleshooting.md |
Common failures, warning follow-up, and interpretation guidance |
| Need CLI examples or common parameter combinations | references/cli-guide.md |
Detailed command patterns for standard, variant, and test runs |
| Need example input files or schema-concrete fixtures | tests/data/ |
Inspect sample CSV layouts for expression and group inputs |
| Need expected output names or artifact formats | ## Output Files and references/cli-guide.md |
Confirm the files the workflow writes and inspect documented example previews |
| Need to run regression tests | tests/run_tests.R |
Execute the automated test suite |
| Need exact test assertions or edge cases | tests/testthat/test-clustering.R |
Inspect validation, reproducibility, and output checks |
Usage
Rscript scripts/main.R \
--input_file ./expression_matrix.csv \
--group_file ./sample_groups.csv \
--output_dir ./output/ \
--distance_method euclidean \
--linkage_method complete \
--label_column batch \
--timeout_seconds 300 \
--seed 42
Arguments
| Short | Long | Type | Default | Description |
|---|---|---|---|---|
-i |
--input_file |
character | required | Expression matrix file (features as rows, samples as columns) |
-g |
--group_file |
character | required | Sample annotation file (first column sample ID, one metadata column for labels) |
-o |
--output_dir |
character | ./output/ |
Output directory |
-d |
--distance_method |
character | euclidean |
Distance metric for dist(): euclidean, maximum, manhattan, canberra, binary, minkowski |
-m |
--linkage_method |
character | complete |
Linkage method for hclust(): complete, single, average, mcquitty, median, centroid, ward.D, ward.D2 |
-l |
--label_column |
character | second column | Column used as dendrogram labels |
-c |
--label_cex |
numeric | 0.8 |
Dendrogram label size, must be > 0 |
-t |
--timeout_seconds |
integer | 300 |
Elapsed time limit in seconds, must be > 0 |
-s |
--seed |
integer | 42 |
Random seed for reproducibility |
Input Format
Expression Matrix (input_file)
Features as rows, samples as columns, CSV format with feature IDs in the first column.
,Sample01,Sample02,Sample03
TSPAN6,1.847876677,1.831755661,3.827625975
TNMD,0.034919984,0.053250385,1.388850793
Requirements:
- The first column contains unique feature IDs.
- All sample columns must be numeric.
- Sample column names must be unique and non-empty.
- At least two matched samples are required.
Sample Annotation (group_file)
CSV with sample IDs in the first column. The second column is used by default for leaf labels unless --label_column is provided.
sample,batch
Sample01,batch1
Sample02,batch2
Sample03,batch1
Requirements:
- Sample IDs must match expression matrix column names exactly.
- The selected label column must exist and contain no empty values.
- The file must contain at least one metadata column in addition to sample IDs.
Output Files
| File | Description |
|---|---|
hierarchical_clustering_plot.pdf |
Sample dendrogram plot |
sample_distance_matrix.csv |
Pairwise sample distance matrix |
clustering_order.csv |
Leaf order shown in the dendrogram |
matched_samples.csv |
Sample-to-label table used for plotting |
session_info.txt |
R session and package version info |
Workflow
Step 1: Validate Input
WHEN checking file or parameter validation, READ: scripts/validation_utils.R
WHEN checking expression/group CSV handling, READ: scripts/input_functions.R
- Check file existence
- Reject empty files before parsing
- Read the expression matrix and sample annotation CSV files
- Validate required columns, unique IDs, and numeric expression values
Step 2: Align Samples
WHEN checking sample matching logic, READ: scripts/input_functions.R
- Match sample IDs between the annotation file and expression matrix
- Reorder matrix columns to the annotation file order
- Select the label column used for plotting
Step 3: Build Hierarchical Clustering
WHEN interpreting distance or linkage behavior, READ: references/algorithm.md
WHEN checking clustering implementation, READ: scripts/clustering_functions.R
- Transpose the expression matrix to sample-by-feature form
- Compute pairwise sample distances with
dist() - Build the dendrogram with
hclust()
Step 4: Save Outputs
WHEN checking output staging and cleanup behavior, READ: scripts/run_analysis.R
WHEN checking PDF/CSV export behavior, READ: scripts/output_utils.R
WHEN checking timeout, session info, or final file copy behavior, READ: scripts/runtime_utils.R
- Stage outputs in a temporary workspace
- Export the pairwise distance matrix
- Export the plotted leaf order
- Render the dendrogram as PDF
- Copy finalized outputs into the requested output directory
Methods
Distance Matrix
Sample distances are computed from the transposed expression matrix using base R dist().
Hierarchical Clustering
The clustering tree is built with base R hclust(). The default linkage method is complete, matching the source analysis script.
Examples
Basic Usage
Rscript scripts/main.R \
-i tests/data/sample_expression_matrix.csv \
-g tests/data/sample_groups.csv \
-o ./output/ \
-t 300
Use Sample IDs as Labels
Rscript scripts/main.R \
-i tests/data/sample_expression_matrix.csv \
-g tests/data/sample_groups.csv \
-o ./output_sample_labels/ \
-l sample
Use Average Linkage
Rscript scripts/main.R \
-i tests/data/sample_expression_matrix.csv \
-g tests/data/sample_groups.csv \
-o ./output_average/ \
-m average
Error Handling
Common Errors
| Error | Cause | Solution | Read More |
|---|---|---|---|
SKILL_DEPENDENCY_MISSING |
Required R package is not installed | Install the missing package and rerun | references/troubleshooting.md#skill_dependency_missing |
SKILL_FILE_NOT_FOUND |
Input file does not exist or output directory could not be created | Check the path and permissions | references/troubleshooting.md#skill_file_not_found |
SKILL_EMPTY_FILE |
Input file is empty | Re-export the CSV and confirm it contains data | references/troubleshooting.md#skill_empty_file |
SKILL_EMPTY_DATA |
CSV parsed successfully but contains no data rows | Confirm the CSV has at least one data row | references/troubleshooting.md#skill_empty_data |
SKILL_PARSE_ERROR |
CSV parsing failed | Check encoding, delimiters, and CSV structure | references/troubleshooting.md#skill_parse_error |
SKILL_MISSING_COLUMNS |
Expected columns or headers are missing | Check CSV headers and metadata columns | references/troubleshooting.md#skill_missing_columns |
SKILL_INVALID_TYPE |
Expression values or parameters have the wrong type | Ensure numeric fields are numeric | references/troubleshooting.md#skill_invalid_type |
SKILL_SAMPLE_MISMATCH |
Sample IDs do not match | Ensure the first column in group_file matches matrix column names |
references/troubleshooting.md#skill_sample_mismatch |
SKILL_INVALID_DATA |
Expression or annotation data is malformed | Check duplicate IDs, missing labels, and numeric values | references/troubleshooting.md#skill_invalid_data |
SKILL_INVALID_PARAMETER |
Unsupported distance, linkage, or label parameter | Use one of the documented parameter values | references/troubleshooting.md#skill_invalid_parameter |
SKILL_TIMEOUT |
Analysis exceeded the time limit | Increase --timeout_seconds and rerun |
references/troubleshooting.md#skill_timeout |
SKILL_PLOT_ERROR |
Plot device failed while writing PDF | Check output directory permissions and rerun | references/troubleshooting.md#skill_plot_error |
SKILL_WRITE_ERROR |
Output or intermediate files could not be written | Check output directory permissions and free disk space | references/troubleshooting.md#skill_write_error |
SKILL_WARNING |
Non-fatal warning occurred during execution | Inspect console warnings and verify output quality | references/troubleshooting.md#skill_warning |
SKILL_MEMORY_WARNING |
Memory usage exceeded the warning threshold | Reduce input size or rerun with more memory | references/troubleshooting.md#skill_memory_warning |
IF error persists, READ: references/troubleshooting.md
Testing
Test with Sample Data
# Check help
Rscript scripts/main.R --help
# Run with sample data
Rscript scripts/main.R \
-i tests/data/sample_expression_matrix.csv \
-g tests/data/sample_groups.csv \
-o ./output/
# Run unit tests (requires testthat and data.table)
Rscript tests/run_tests.R
Validation Commands
# Check main output plot exists
ls -la ./output/hierarchical_clustering_plot.pdf
# Inspect clustering order
wc -l ./output/clustering_order.csv
Implementation Checklist
- CLI parsing with
optparse -
set.seed()for reproducibility - Input validation (file existence, emptiness, types, required columns)
- Try-catch based fatal error handling
- Standardized
SKILL_*error classification - Timeout control with
setTimeLimit() - Standardized console-only logging
- Base R clustering implementation
- Session info recording with
sink() - Temporary workspace cleanup with
on.exit() - Memory usage reporting with
gc() - File reading instructions in SKILL.md
- Modular script structure across
scripts/ - Test template added under
tests/testthat/ - Test data provided
- Error handling with
SKILL_*codes -
get_script_dir()defined before use - Scripts in
scripts/directory - References in
references/directory
Last updated: 2026-04-16 | Version: 1.0.0