name: lynk-evaluate
description: >
Evaluate the Lynk semantic layer in .lynk/ — judge whether it is good enough
for the AI agent to use, not just whether the YAML parses. Checks description
quality, cross-file consistency, content placement, reference integrity, and
SQL dialect compatibility against the user's warehouse engine.
Use this skill whenever the user asks to evaluate, audit, review, assess, or
diagnose the semantic layer or any part of it. Trigger on phrases like
"evaluate the semantics", "is this good enough for the agent", "audit my
entities", "check description quality", "any contradictions in my context",
"will my SQL run on Snowflake", "will this work on BigQuery", "review the
glossary", "check my evaluations against instructions", "evaluate player",
"is the semantic layer well structured", or any request to assess the quality
of files inside .lynk/.
lynk-evaluate-semantics
Steps
1. Determine what to evaluate
If the user has not specified what to evaluate, use the AskUserQuestion tool to ask them. Before presenting options, check git history to surface recently edited artifacts:
! git log --oneline --diff-filter=M --name-only -20 -- .lynk/ | head -40
Use that output to identify the last 3 distinct .lynk/ artifacts that were modified (entity YAML, knowledge file, task instructions, glossary, evaluations, etc.). Strip the domain path and file extension to present a clean artifact name (e.g., player entity, nba_glossary, evaluations).
Present these options to the user:
- Option 1 — Last edited artifact (e.g.,
player entity) - Option 2 — Second-to-last edited artifact
- Option 3 — Third-to-last edited artifact
- Option 4 — Evaluate the entire semantic graph end-to-end
If git history doesn't yield 3 distinct artifacts, fill remaining slots with sensible defaults (e.g., the glossary, evaluations, or the largest entity in the domain).
Once the user selects, continue to Step 2 with the chosen target.
2. Locate the target files
Scan the semantic layer file tree:
! find ./.lynk -type f | sort
Based on the user's selection, identify the relevant files:
| Evaluation target | Files to read |
|---|---|
| Specific entity | Entity YAML + its knowledge file + its task instructions file + domain knowledge + domain task instructions + glossary |
| Entity + related entities | Same as above, but for the seed entity AND every entity it is related to (via entities_relationships.yml) |
| Glossary | The glossary file only |
| Full semantic graph | All entity YAMLs + all knowledge/task instruction files + glossary + domain context + evaluations + relationships |
| Evaluations | evaluations.yml + all entity YAMLs (to verify references) |
If the user asked about a specific metric, feature, or relationship on an entity, still evaluate the full entity context — but lead your response with the specific item they asked about.
3. Read the target files
Read only the files identified in Step 2. For entity evaluation, read in this order:
- Entity YAML
- Entity knowledge file
- Entity task instructions file
- Domain knowledge + domain task instructions
- Glossary
For multi-entity evaluation (seed + related), read entities_relationships.yml first to determine the related entity set, then read each entity's files.
4. Read the relevant docs and detect the SQL engine
- Always fetch
https://docs.getlynk.ai/llms.txtfirst to see the doc tree — the placement check (Rule 2 ofreferences/content-rules.md) in Step 6 depends on knowing what file-type specs exist. ThenWebFetchonly theconcepts/<concept>andfile-types/<type>pages relevant to the targets in Step 2. (The doc-navigation convention is inreferences/lynk-docs.md.) - Detect the engine. Read
.lynk/config.jsonand look for anengine,dialect, orwarehousefield. Common values:bigquery,snowflake,postgres,redshift,databricks. If the field is missing, empty, or the file doesn't exist, ask the user viaAskUserQuestion— do not guess. Record the dialect; every SQL check in Step 6 keys off it.
5. Run the backend validity check
Run the lynk-validate flow — its steps 1–5 (branch detection, dirty-tree handling, origin check, token check, API call) — without producing validate's report. Capture the raw issue list for merging into the unified report in Step 7.
If validate skips (no token, user cancelled at the dirty-tree prompt, or branch not on origin), record the skip reason as one of: no token, user cancelled, branch not on origin. Do not abort the evaluation — local checks in Step 6 still run regardless.
6. Evaluate locally
For each finding, record: severity (error / warning / needs-client-input / suggestion), location (file + field or feature name), what's wrong, how to fix it.
Apply these check groups against the target files:
- Content rules — apply every rule in
references/content-rules.md(Rules 1–12; the rule index is at the top) against the target files, and tag each findinglocal/content-rules-<N>with its rule number. The Quick check at the bottom ofcontent-rules.mdis the minimum coverage — skip nothing. - YAML & SQL structure — required fields present,
{}placeholders in metric SQL,METRIC()wrapping where required, no aggregates inside formula features, no circular formula dependencies, no duplicate feature / metric / relationship keys, and everykeys:entry resolves to a featurename:whosefield(source column) equals that feature name (Rule 6c). Two failure modes, both invisible to backendvalidateand only caught by a build / query: (a) a raw-column key (keys: [ID]) that is not a feature at all → "Key 'ID' is not defined as a feature"; (b) a renamed key feature (keys: [send_id]wheresend_idisfield: ID) — a confirmed live engine bug that either errorsinvalid identifier KEYS.IDor, once the build proceeds, flags every derived feature on the entity ('<feature>' cannot be queried) and any relationship off it (cannot be joined). Fix both by giving the key column a feature whose name matches it (idonID) and keying on that. Severity:error. - Examples & evaluations quality — covers every
examples:entry in entity YAMLs, everyevaluations.ymlcase, and every SQL example in task-instruction / knowledge markdown; the agent learns its query patterns from all three, so a broken one teaches a broken pattern. Four sub-checks (detection detail lives in the cited rules):- Validity & queryability [
local/content-rules-10] — apply Rule 10 (dialect, canonical surface, references exist and are queryable — the mechanical_-prefix scan is in the rule). Severity:error. - Input ↔ expected_output coherence [
local/examples-quality] — the SQL's filters, grouping, metric/feature selection, and time window match what theinputasks (Rule 10 point 4). Severity:warning, escalate toerrorwhen the divergence changes which entity / metric / dimension is tested. - Description ↔ test alignment [
local/examples-quality] — thedescriptionstates what the case actually tests. Severity:warning. Pure description red flags (tautological, placeholder) belong under Rule 4. - Default-filter consistency [
local/content-rules-5] — every default filter the task-instructions declare for that question type appears inexpected_output(Rule 5). needs-client-input if the omission might be intentional, otherwiseerror.
- Validity & queryability [
When examples and task instructions disagree on the intended behavior, mark it needs-client-input rather than picking a side (see Rule 5 of content-rules.md).
7. Optionally execute every example and evaluation (gated, calls query-engine)
Some failures only surface when the SQL actually runs — a column the engine doesn't expose, a relationship that doesn't resolve, a metric whose sql: produces a warehouse error, or a private _-prefixed feature referenced in generated SQL (Rule 10 point 3), which passes a naive "is it declared?" check but fails at plan time. The static checks in Step 6 catch what's reasonably checkable from the YAML alone (including the mechanical _-prefix scan); this step is the authoritative confirmation — it executes each expected_output against the warehouse. Scope: every entry under examples: in entity YAMLs and every evaluations.yml test case. SQL examples embedded in task-instruction / knowledge markdown often carry placeholders (<company>, a dimension stand-in) and are covered by the Step 6 static Rule 10 check rather than executed here; run one only if you substitute realistic literals first.
Key-uniqueness probe (Rule 11). This same gated, warehouse-calling pass is also where a suspected fabricated key gets confirmed. When the user opts in (any answer but Skip), then for each in-scope entity that Step 6 flagged needs-client-input under Rule 11 — or whose source the catalog reports no keys for — delegate to lynk-sources to run SELECT COUNT(*) AS rows, COUNT(DISTINCT keys) AS distinct_rows FROM <key_source> LIMIT 1. distinct_rows < rows → the declared key doesn't uniquely identify a row → error, tagged local/content-rules-11; quote both counts in the finding. This is the authoritative confirmation of the Rule 11 static suspicion (Step 6 can only suspect; uniqueness is a property of the data).
Ask the user first via AskUserQuestion:
"Run every example and evaluation against the warehouse via
query-engine/query(withLIMIT 1so each call is fast)? Yes / Yes, evaluations only / Yes, examples only / Skip."
This is opt-in. Never run it without asking — it dispatches real queries to the warehouse and takes seconds to tens of seconds per call.
If the user opts in, for each expected_output in scope:
- Apply
LIMIT 1. If the SQL already has aLIMIT, leave it; if it has none, appendLIMIT 1. Do not wrap the query in a subquery — subqueries change thesemantics_usedshape returned by the engine and complicate error reporting. - Delegate to
lynk-sourceswith the "Run Lynk SQL" action (POST query-engine/query, body is the SQL as a JSON-encoded string — see lynk-sources Step 3 for invocation, andreferences/rest-api.mdfor the full endpoint shape). Use the same branch and domainlynk-evaluateis operating on. - Record the outcome:
200→ pass. Optionally capturemetadata.query_metadata.semantics_usedso the report can compare resolved entities / features against what the case claims to test (a case named "tests refund metric" whosesemantics_used.metricsdoesn't include the refund metric is a real finding).422witherror_type: SemanticsConsumptionError→ error, taggedlocal/examples-runtime. Surface themessageverbatim ("Feature 'X' does not exist in entity 'Y'") and quote the offending SQL line.500witherror_type: InternalErrorandmessage: "SQL error: ParserError(...)"→ error, taggedlocal/examples-runtime. Quote the parser message.500with bare"Request failed"and nodetailenvelope → needs-client-input, taggedlocal/examples-runtime. The branch's semantic layer is itself in a broken state; running examples isn't meaningful until the underlying layer validates. Recommend runninglynk-validateon the same branch first.
- Cap the dispatch. If more than ~30 queries are in scope, re-prompt the user to narrow scope ("Run all 87, or only the ones in the entity we're evaluating?"). Don't silently dispatch 100+ warehouse calls.
Merge runtime findings into the Step 8 report alongside the static ones — same severity tiers, separate source tag (local/examples-runtime).
If the user picks Skip or the step was bypassed (no examples / no evaluations in scope), record the skip reason and continue to Step 8.
8. Produce the evaluation report
Merge the backend issues from Step 5 with the static local findings from Step 6 and the runtime findings from Step 7 (if that step ran) into one unified report. Each issue carries a source tag so the user knows where it came from.
If the user asked about a specific metric / feature / relationship: lead with a focused section on that item — its evaluation result, issues, and suggested fixes — before the broader entity report.
Report structure:
## Evaluation Report — [Target Name] (engine: [dialect]) · Backend: [ok | <n> errors, <m> warnings | skipped: <reason>]
### Summary
[1-2 sentences: overall health, number of issues by severity, dialect applied, backend status]
### Errors (must fix)
- **[Location]** [backend/<scope>/<category> | local/<check-group>]: [What's wrong] → [How to fix]
### Warnings (should fix)
- **[Location]** [backend/... | local/...]: [What's wrong] → [How to fix]
### Needs client input
- **[Location]** [local/...]: [Conflict between instructions and examples / unresolvable intent] → [What you need from the user]
### Suggestions (nice to have)
- **[Location]** [local/...]: [What could be improved] → [Suggested improvement]
### What looks good
- [Bullet list of things that are well-modeled — be specific]
Source tag values. Every finding carries one tag so the user can tell at a glance where it came from — the backend API, a local rule, a content-quality check, or the runtime execution. Five shapes:
backend/<scope>/<category>— raised by thelynk-validateAPI call in Step 5.<scope>isentity/relationship/context;<category>isschema(declarative YAML check) orwarehouse(the backend ran aLIMIT 0probe and the engine rejected it). Thewarehousecategory replaces the legacysemanticvalue — same tag shape, the enum just changed when validate moved to the builds endpoint.local/content-rules-<N>— raised by a content rule in Step 6;<N>is the rule number (see the rule index at the top ofreferences/content-rules.md). Note: Rule 3 is action protocol, not a detection — misplacement findings get taggedcontent-rules-2and cite Rule 3 in the suggested fix.local/yaml-sql-structure— raised by the structural validation group in Step 6 (required fields,{}placeholders,METRIC()wrapping, no aggregates in formulas, no circular formulas, no duplicate keys).local/examples-quality— raised by the examples & evaluations quality group in Step 6 for the semantic-alignment sub-checks (input ↔ expected_output coherence, description ↔ test alignment). The same group's validity & queryability sub-check is taggedlocal/content-rules-10, and its default-filter-consistency sub-check is taggedlocal/content-rules-5.local/examples-runtime— raised by the optional runtime-execution pass in Step 7 (theexpected_outputSQL didn't execute against the warehouse). Carries theerror_typeandmessagefrom the engine response.
When the backend was skipped, the summary's Backend: field reads skipped: <reason> and the report contains only [local/...] issues. Mention the skip reason explicitly in the Summary paragraph so the user knows backend issues weren't checked.
If no issues are found in a severity tier, omit that section entirely.
9. Offer fixes and re-evaluate (bounded loop, hard cap 3)
If the report has errors or warnings, this skill — and only this skill — drives the fix-and-recheck loop. Build's Step 8 and validate's Output Format defer here; never run a parallel fix loop in those skills.
For each iteration (max 3):
Offer fixes via
AskUserQuestion:- If errors exist: single option
Fix all <N> errors and ask about warnings, plusStop — accept remaining issues. - If only warnings exist: present them as
multiSelect: trueso the user picks which to fix. IncludeStop. - Suggestions are never auto-fixed; mention them but don't include in the offer.
- If errors exist: single option
If the user opts in: delegate the edits to
lynk-buildSteps 6–7 (plan and confirm, then execute). Build re-reads the relevant docs as part of its normal flow, so every fix attempt stays doc-grounded.Re-check locally: re-run Steps 3 and 6 only (re-read the edited files; re-do local checks). Skip Step 4 (engine/docs unchanged), Step 5 (backend won't see uncommitted changes), and Step 7 (don't re-dispatch warehouse calls inside the loop — runtime issues that were already reported will still be reported on commit, and running mid-loop would burn the user's warehouse credits per iteration).
Repeat with the iteration number in the prompt (
Attempt 2 of 3 — <N> issues remain. Fix? Stop?). After iteration 3, exit unconditionally even if issues remain. Tell the user: "Reached the 3-attempt cap.issues remain — fix manually, or re-run This cap is non-negotiable; it prevents runaway loops if the agent can't converge.lynk-evaluateto start a fresh loop."If the user picks Stop at any iteration, exit immediately and leave the remaining issues in the final report.
Backend re-check (post-loop). If Step 5 reported backend issues and any fixes were applied during the loop, ask the user once: "Commit your fixes and re-run the backend check?". If yes, run the full lynk-validate flow (it handles commit-and-push and the API call). Otherwise, leave the original backend findings in the report annotated (initial check; may be stale after local fixes).
Runtime re-check (post-loop). If Step 7 ran and reported local/examples-runtime issues and fixes were applied, ask the user once: "Re-run the affected examples / evaluations against the warehouse to confirm they now execute?". If yes, re-dispatch only the previously-failing queries — not the full set.
For full graph evaluation, group findings by entity/file rather than by severity tier, so the user can focus on one entity at a time.
For evaluations evaluation, group findings by evaluation name and add a coverage summary at the top showing entity distribution.
Output Format
- Always state the detected engine on the summary line so the user knows which dialect rules were applied.
- Use code blocks when quoting YAML field names, feature names, or SQL snippets.
- Reference exact file paths so the user can navigate directly.
- Be specific about locations — say
player.yml → feature: career_points → metric: nonexistent_metricnot just "a feature has an issue". - Lead with the most important findings; don't bury critical errors at the bottom.
- If you find issues in the files, offer to fix them — but only after completing the full report.