hierarchical-clustering-plot

name: hierarchical-clustering-plot description: "Use when building a sample-level hierarchical clustering dendrogram from a bulk expression matrix and sample annotation table, especially for QC, batch inspection, or sample similarity assessment. Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity. NOT for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows."

Hierarchical Clustering Plot

When to Use

Use this skill when you need a sample-level hierarchical clustering dendrogram from a bulk expression matrix and a sample annotation table.

Good fits: sample QC, batch inspection, sample similarity assessment, checking whether annotated sample groups cluster as expected.
Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity.
Not for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows.

When to Read External Files

Situation	File to Read	Purpose
Need algorithm details	`references/algorithm.md`	Distance calculation, linkage rules, and clustering assumptions
Need to run analysis or inspect CLI entrypoint behavior	`scripts/main.R`	Execute the workflow and inspect argument parsing, defaults, required flags, and sourced modules
Need workflow implementation details	`scripts/run_analysis.R`	See orchestration order, temp workspace handling, and output generation
Need logging or warning behavior	`scripts/logging_utils.R`	See standardized console log formatting and memory usage messages
Need file or parameter validation details	`scripts/validation_utils.R`	See path checks, output-directory checks, and scalar validation
Need timeout, temp workspace, or session info behavior	`scripts/runtime_utils.R`	See timeout control, temp cleanup, output copying, and session-info export
Need expression/group input handling	`scripts/input_functions.R`	See CSV loading, sample matching, and label extraction
Need clustering logic	`scripts/clustering_functions.R`	See distance calculation and `hclust()` generation
Need output-writing logic	`scripts/output_utils.R`	See CSV export and PDF rendering
Encounter errors, warnings, or unexpected clustering patterns	`references/troubleshooting.md`	Common failures, warning follow-up, and interpretation guidance
Need CLI examples or common parameter combinations	`references/cli-guide.md`	Detailed command patterns for standard, variant, and test runs
Need example input files or schema-concrete fixtures	`tests/data/`	Inspect sample CSV layouts for expression and group inputs
Need expected output names or artifact formats	`## Output Files` and `references/cli-guide.md`	Confirm the files the workflow writes and inspect documented example previews
Need to run regression tests	`tests/run_tests.R`	Execute the automated test suite
Need exact test assertions or edge cases	`tests/testthat/test-clustering.R`	Inspect validation, reproducibility, and output checks

Usage

Rscript scripts/main.R \
  --input_file ./expression_matrix.csv \
  --group_file ./sample_groups.csv \
  --output_dir ./output/ \
  --distance_method euclidean \
  --linkage_method complete \
  --label_column batch \
  --timeout_seconds 300 \
  --seed 42

Arguments

Short	Long	Type	Default	Description
`-i`	`--input_file`	character	required	Expression matrix file (features as rows, samples as columns)
`-g`	`--group_file`	character	required	Sample annotation file (first column sample ID, one metadata column for labels)
`-o`	`--output_dir`	character	`./output/`	Output directory
`-d`	`--distance_method`	character	`euclidean`	Distance metric for `dist()`: euclidean, maximum, manhattan, canberra, binary, minkowski
`-m`	`--linkage_method`	character	`complete`	Linkage method for `hclust()`: complete, single, average, mcquitty, median, centroid, ward.D, ward.D2
`-l`	`--label_column`	character	second column	Column used as dendrogram labels
`-c`	`--label_cex`	numeric	`0.8`	Dendrogram label size, must be `> 0`
`-t`	`--timeout_seconds`	integer	`300`	Elapsed time limit in seconds, must be `> 0`
`-s`	`--seed`	integer	`42`	Random seed for reproducibility

Input Format

Expression Matrix (`input_file`)

Features as rows, samples as columns, CSV format with feature IDs in the first column.

,Sample01,Sample02,Sample03
TSPAN6,1.847876677,1.831755661,3.827625975
TNMD,0.034919984,0.053250385,1.388850793

Requirements:

The first column contains unique feature IDs.
All sample columns must be numeric.
Sample column names must be unique and non-empty.
At least two matched samples are required.

Sample Annotation (`group_file`)

CSV with sample IDs in the first column. The second column is used by default for leaf labels unless --label_column is provided.

sample,batch
Sample01,batch1
Sample02,batch2
Sample03,batch1

Requirements:

Sample IDs must match expression matrix column names exactly.
The selected label column must exist and contain no empty values.
The file must contain at least one metadata column in addition to sample IDs.

Output Files

File	Description
`hierarchical_clustering_plot.pdf`	Sample dendrogram plot
`sample_distance_matrix.csv`	Pairwise sample distance matrix
`clustering_order.csv`	Leaf order shown in the dendrogram
`matched_samples.csv`	Sample-to-label table used for plotting
`session_info.txt`	R session and package version info

Workflow

Step 1: Validate Input

WHEN checking file or parameter validation, READ: scripts/validation_utils.R

WHEN checking expression/group CSV handling, READ: scripts/input_functions.R

Check file existence
Reject empty files before parsing
Read the expression matrix and sample annotation CSV files
Validate required columns, unique IDs, and numeric expression values

Step 2: Align Samples

WHEN checking sample matching logic, READ: scripts/input_functions.R

Match sample IDs between the annotation file and expression matrix
Reorder matrix columns to the annotation file order
Select the label column used for plotting

Step 3: Build Hierarchical Clustering

WHEN interpreting distance or linkage behavior, READ: references/algorithm.md

WHEN checking clustering implementation, READ: scripts/clustering_functions.R

Transpose the expression matrix to sample-by-feature form
Compute pairwise sample distances with dist()
Build the dendrogram with hclust()

Step 4: Save Outputs

WHEN checking output staging and cleanup behavior, READ: scripts/run_analysis.R

WHEN checking PDF/CSV export behavior, READ: scripts/output_utils.R

WHEN checking timeout, session info, or final file copy behavior, READ: scripts/runtime_utils.R

Stage outputs in a temporary workspace
Export the pairwise distance matrix
Export the plotted leaf order
Render the dendrogram as PDF
Copy finalized outputs into the requested output directory

Methods

Distance Matrix

Sample distances are computed from the transposed expression matrix using base R dist().