name: mfs-find
version: 0.4.0
mfs_compat: ">=0.4,<0.5"
description: >-
Search, grep, browse, and read across registered MFS data sources via the
mfs CLI — codebases, docs, PDFs, web crawls, databases
(postgres/mysql/mongo/snowflake/bigquery), issue trackers
(jira/linear/github), CRMs (hubspot), chat (slack/discord/gmail/feishu),
object stores (s3/gdrive). Use whenever the user asks to find, locate,
look up, look across, or read something out of an already-configured MFS
index. Trigger phrases include "search the codebase for", "find anywhere
about", "where is X mentioned", "look across our [slack/jira/postgres/etc]",
"any past tickets/RFCs/commits about", "what does our wiki say about",
"cat / head / tail / ls / tree this MFS path". Do NOT use for:
registering a NEW data source (use mfs-ingest), changing connector config,
kicking off re-ingest, or any write/delete operation — mfs is read-only.
MFS — find / read across configured sources
1. What MFS is
A retrieval layer that exposes many kinds of content as a unified path tree and makes that tree searchable through one hybrid index:
- One CLI (
mfs), one mental model. Local dir, Postgres, GitHub repo, Slack workspace, S3 bucket, BigQuery dataset — all addressed as paths under their<scheme>://URI. Same verbs everywhere:ls / tree / cat / head / tail / grep / search / export. - One hybrid index. Dense vectors (semantic) + BM25 (keyword) fused per query — covers conceptual recall and exact-token recall in one call.
- POSIX-style locators. Every search hit carries a
locatorthat reopens the exact unit:{"lines":[s,e]}for text/code, a PK dict for rows/issues/threads.
2. When to use MFS — and when NOT to
| Situation | Use MFS? |
|---|---|
| 1000+ files / rows / pages, you don't know where the answer is | ✅ |
| Cross-source question ("any past tickets / commits / RFCs about X") | ✅ --all |
| Concept-style query that won't match literally | ✅ --mode semantic |
| You already know the exact file + roughly where to look | ❌ plain cat/grep |
| Exact identifier / error code in 5 files you can list | ❌ plain grep/rg |
| Real-time tailing of a live log | ❌ index lags ingest |
| The source isn't in MFS yet | wrong skill, use mfs-ingest to register first |
Rule: use the smallest tool that answers the question. MFS pays off
when the scope is too big for rg.
Borderline — ASK the user:
| Ask | Likely answer | Why |
|---|---|---|
| "Summarise these 10 PDFs" | ✅ mfs search + cat --peek per hit |
each PDF gets a converted_md artifact + searchable chunks |
| "Find similar tickets to this one" | ✅ paste the ticket text as the search query | semantic over row_text does similarity matching |
| "Watch for new slack messages" | ❌ no watch capability; use Slack's API |
index lags ingest |
| "Look up user 12345" | ❌ mfs cat <source> --locator '{"id":12345}' directly (skip search) |
one-record-by-id doesn't need ranking |
3. Pre-flight — confirm the source is indexed
Before running any query, especially on cross-source asks:
mfs status # server up? any connectors registered?
mfs connector inspect <uri> # this connector's object/job summary
mfs ls <uri> --json # per-entry capabilities + indexable / search_status
- Server unreachable → tell user to start it (
mfs serve startif self-hosted), or this skill can't proceed. connectorsempty → user hasn't ingested anything yet. Redirect tomfs-ingest— don't try to search nothing.search_status: unavailablefor the target URI → onlygrep/ls/catwork; offer those or redirect tomfs-ingestfor a re-sync.building→ sync in flight; fall back tomfs grep(works without an index) until done.partial→ recall incomplete but usable; flag the caveat to the user.
4. The core workflow: search → locate → browse
search locate browse
┌──────────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ semantic + BM25 │ → │ result has lines │ → │ cat --range / cat │
│ finds candidates│ │ or a locator │ │ --peek to confirm │
└──────────────────┘ └──────────────────┘ └─────────────────────┘
On large corpora this loop is the whole point: read only the part that matters. On small corpora it's still fine, just lighter.
Concrete:
- Search:
mfs search "<what the user actually wants>" <path-or-uri> --top-k 10 - Locate — every hit's envelope carries
locator:- text/code →
{"lines":[start,end]}→mfs cat <source> --range start:end - structured (row/issue/thread) → PK dict →
mfs cat <source> --locator '{...}' - once-per-object (dir/schema summary, image VLM) →
null→mfs cat <source>
- text/code →
- Browse — verify only what's needed:
mfs cat --peek <file> # outline (headings / function signatures) mfs cat --skim <file> # peek + one-line summaries per section mfs head -n 20 <uri> # first records of a structured object mfs tree <uri> -L 2 # subtree shape
5. Index requirement rules of thumb
mfs searchrequires an index.mfs grepworks without — pushdown → BM25 → linear scan fallback.mfs ls / tree / cat / head / tailbrowse without an index.
6. Search modes
mfs search defaults to hybrid. Override only when you know why.
--mode |
Mechanic | When |
|---|---|---|
hybrid (default) |
dense + BM25 fused with RRF | almost always |
semantic |
dense only | conceptual query, wording won't match literally |
keyword |
BM25 only | exact-term (config key, error code) without semantic drift |
Other useful flags:
--top-k N— default 10; raise to 20-30 on a weak first round.--all— search every registered connector. Otherwise scope to a path/URI prefix.--kind <list>— restrict chunk kinds (row_text,thread_aggregate,chunk_body,summary,vlm_description, …).--collapse— keep only the top-scoring chunk per object; later chunks from the same source are dropped, not merged. Recall stays as-is (the query still hits the same candidates), but the visible result count can fall below--top-k— collapse is a post-filter, not a re-rank. If you want N distinct objects, raise--top-k(e.g.--top-k 30 --collapse).
--all: when yes, when no
- ✅ Cross-source recall — "any past tickets / commits / RFCs / slack about X".
- ❌ You know the source — scope to
slack://; postgres + jira + docs together aren't comparable. - ⚠ More than ~5 registered connectors — ASK the user whether to fan out widely or scope to the 2-3 likeliest sources first.
7. Decision tree — pick the smallest useful tool
| Signal in the ask | Sub-task | Use |
|---|---|---|
| natural-language question / sentence | exploratory | mfs search "<q>" <scope> |
| paraphrased / conceptual wording | semantic-only | mfs search --mode semantic |
| exact identifier / config key / unique phrase | literal anchor | mfs grep "<lit>" <path> (or rg) |
| filename / directory pattern | path lookup | find / shell glob / fd |
| known file + needs outline | structural overview | mfs cat --peek <file> |
| known file + compact summary | dense overview | mfs cat --skim <file> |
| search hit + surrounding context | reopen | mfs cat <file> --range s:e |
| structured hit (row/issue/thread) | reopen by PK | mfs cat <source> --locator '{...}' |
| several close candidates | compare | mfs cat --peek each, then pick |
| single record + known key | no-search lookup | mfs cat <source> --locator '{"id":12}' |
| first / last N | sample | mfs head -n N / mfs tail -n N |
| subtree shape | orient | mfs tree -L 2 <uri> |
| full object for offline tooling | export | mfs export <uri> <file> |
mfs search requires an explicit scope or --all.
8. Command cheat sheet
Search
mfs search "<query>" <path-or-uri> # default: hybrid, top-k=10
mfs search "<query>" --all # whole namespace
mfs search "<query>" <path> --top-k 20 # more candidates
mfs search "<query>" <path> --mode semantic # dense-only
mfs search "<query>" <path> --mode keyword # BM25-only
mfs search "<query>" <path> --kind row_text # restrict chunk kinds
mfs search "<query>" <path> --collapse # keep top hit per object (post-filter, may return < top-k)
Grep
mfs grep "<pattern>" <path> # pushdown -> BM25 -> linear
mfs grep is not grep. The three-tier dispatch is:
- Pushdown — for structured connectors (
postgres,mongo,jira, …) the pattern is shipped to the source as a LIKE / regex filter. Literal-exact, token-level; no regex on most structured connectors. - BM25 over indexed objects — for
body/code/documentchunks already in Milvus, the pattern is fed through the same sparse indexsearch --mode keyworduses. That is a tokenized, ranked lookup, not a literal substring scan: an analyzer split likegetUserId→get,user,idwill rankuserIdas a hit; a CJK pattern with no analyzer match returns nothing even when the literal bytes are present. If you need "does this exact byte string appear anywhere?",mfs exportthe object and runrglocally —mfs grephas no "force linear over indexed objects" flag. - Linear scan — only for not-indexed files in scope (file
connector before
mfs add). True substring / regex.
For exact-exhaustive on a huge structured object, mfs export then
local grep / rg.
Read
mfs cat <path> # full content (refused if "lazy")
mfs cat <path> --range A:B # lines A..B-1 (1-based, end-exclusive)
mfs cat <path> --locator '{"id":12}' # reopen a structured record
mfs cat <path> --peek # outline only
mfs cat <path> --skim # peek + per-section summaries
mfs cat <path> --meta # stat-style, not content
Density ladder:
| Mode | Use it when |
|---|---|
--peek |
"show me the outline" |
--skim |
+ one-line summary per section, still concise |
| (default) | full content; small file or really need it |
--range A:B |
already know which lines matter (e.g. search hit) |
mfs head -n 50 <path> # first 50 lines/records
mfs tail -n 50 <path> # last 50; native-accel reverse read
For a lazy rows.jsonl / messages.jsonl, head is how to see record
shape without paying full-scan cost.
Browse
mfs ls <uri> # one level
mfs tree <uri> -L 2 # depth-bounded recursive
NOT a substitute for search when the target is unknown and conceptual.
Export
mfs export <uri> <out-file> # full object to disk for jq/awk pipelines
cat of a huge lazy object is refused — use export for bulk processing.
Status (useful before AND during search work)
mfs status # server + all connectors
mfs connector inspect <uri> # one connector's object/job summary
mfs connector list # list registered connectors
mfs job list # recent indexing jobs (background re-syncs)
Always prefer --json when output will be parsed.
9. Weak results → recover, don't thrash
If top hits look off-topic:
- Rewrite with synonyms / domain terms. ASK the user for the domain term they'd actually use if vague. One clarifier beats five blind queries.
- Raise
--top-kto compare distinct candidates. mfs cat --peekthe top few to compare structure.- Switch mode — semantic if hybrid was keyword-noisy; keyword if specific terms should be the anchor.
- Then literal
grep— only if the task has a real literal anchor (error code, config key, identifier).
Literal search is a different tool, not a stronger version of semantic.
10. Candidate selection
Think at object level, not just chunk level:
- Merge repeated hits from the same object into one candidate.
- Compare the top distinct candidates'
--peekwhen titles or snippets look adjacent. - Prefer an object whose main topic directly matches the request over a broad overview that mentions it.
- Multi-part prompts (two entities, setup + troubleshooting, migration source + target) — check whether more than one object is needed.
mfs search "<query>" <path> --top-k 20
mfs cat --peek <candidate-a>
mfs cat --peek <candidate-b>
mfs cat <best> --range <start>:<end>
11. Anti-patterns
- Don't grep to "confirm" a successful semantic hit. The hit's snippet IS the source content; trust it.
- Don't read a whole large file when
--peek/--skim/--rangecan answer. - Don't blindly pick rank #1 when #1-#3 are clearly different objects.
- Don't stop at one match if the prompt mentions multiple entities.
- Don't search the same vague words after a weak first round — fix the query or escalate to literal anchors.
- Don't
cata lazy object (DBrows.jsonl, SaaSrecords.jsonl, chatmessages.jsonl). Usehead,--range,--locator, orexport. - Don't use MFS for sources you'd just clone/download anyway — pull
locally and use the
fileconnector.
12. When search returns nothing on a freshly indexed connector
This is the most common diagnostic case. Walk this ladder, stop on first hit:
# 1. THIS connector's object/job summary
mfs connector inspect <uri>
# 2. failed sync jobs?
mfs job list
mfs job show <job-id> # if any failed, read the job's error field
# 3. per-object granularity
mfs ls <uri> --json # each entry's search_status
# 4. JSON error code path
# If --json output carried a `code` field, see reference/error-codes.md
# 5. otherwise treat as query-level — §9 above
| Signal | Meaning | Action |
|---|---|---|
building |
sync in flight | wait, or mfs grep until done |
partial |
chunks dropped (chunk_max / max_read_rows) |
usable but incomplete; user may want to re-ingest with raised caps (→ redirect to mfs-ingest) |
unavailable |
nothing indexed | only grep / ls / cat work; redirect to mfs-ingest |
available but smoke search empty |
wrong text_fields / source empty / wrong scope |
check reference/connectors/<scheme>.md for that connector's shape; mfs cat a known object to confirm content |
When the diagnosis points to "ingest config is wrong" or "needs re-sync
with different settings" — don't try to fix it from this skill. Tell
the user to invoke mfs-ingest for that connector.
13. Reference routing
These reference files are loaded ONLY when the situation matches — don't open speculatively.
reference/json-envelope.md— WHEN parsing a--jsonsearch/grep result and thelocatorshape is unfamiliar (composite PKs,thread_ts, nested keys); OR when uncertain how to feed a hit back intomfs cat(range vs locator dispatch).reference/error-codes.md— WHEN anmfscommand returned--jsonerror output with acodefield. Read the message first — don't open just because a command failed.reference/connectors/<scheme>.md— WHEN searching a specific connector and you need its tree shape, record field semantics, locator format, or search-strategy tips. STOP and read the matching one BEFORE guessing how that source enumerates objects or what fields its records carry. Schemes:file,web,s3,gdrive,postgres,mysql,snowflake,bigquery,mongo,github,jira,linear,hubspot,notion,zendesk,slack,discord,gmail,feishu.
Runtime capability for a specific URI is queried structurally via
mfs ls <uri> --json; the static per-connector references describe what
the connector exposes by design.