sourcegraph-explorer

name: sourcegraph-explorer description: "Search and explore Databricks source code via Sourcegraph to answer questions about how the platform works, why things are implemented a certain way, what the limitations are, and how services interact. Use when the user asks about Databricks internals, platform behavior, service architecture, error causes, or feature limitations."

Sourcegraph Code Explorer

Answer questions about the Databricks platform by searching and reading source code through Sourcegraph.

First-time setup check

Before doing anything, verify the tools are available:

which sg-search sg-read sg-cookie-refresh

If any are missing, tell the user:

The Sourcegraph tools aren't set up yet. Run /sourcegraph-setup first to install them.

Then stop — do not proceed until setup is complete.

Tools

`sg-search` — search code

sg-search "query string"

Returns JSON lines: {"repo": "...", "path": "...", "line": N, "text": "..."}. Summary stats (match count, duration) go to stderr.

`sg-read` — read a file

sg-read <repo> <filepath> [start_line] [end_line]

Returns file content with line numbers. Use line ranges for large files.

Examples:

sg-read databricks-eng/universe proto/clusters/v1/clusters.proto
sg-read databricks-eng/runtime sql/core/src/main/scala/SomeFile.scala 50 120

`sg-cookie-refresh` — refresh authentication

sg-cookie-refresh           # auto-detect browser (tries Arc, Chrome, Safari)
sg-cookie-refresh chrome    # force a specific browser

Extracts Sourcegraph session cookies directly from the browser's cookie database and writes them to /tmp/sg_cookie.txt. Validates they work before finishing.

Authentication

These scripts authenticate via browser session cookies (Sourcegraph access tokens are admin-disabled). The cookie file at /tmp/sg_cookie.txt is managed by sg-cookie-refresh.

If a search returns an error, empty results, or an HTML redirect, the cookie has expired:

sg-cookie-refresh

If that reports EXPIRED, the user needs to open https://sourcegraph.prod.databricks-corp.com in their browser and log in, then re-run the script.

Repositories

Repo	Contents	Filter
databricks-eng/universe	Services, APIs, frontend, infrastructure, proto definitions, feature flags, BUILD configs	`repo:databricks-eng/universe`
databricks-eng/runtime	Spark runtime (DBR), cluster components, execution engine	`repo:databricks-eng/runtime`

Search both by default. Narrow to one when the question clearly belongs to a specific repo.

Universe structure (key directories)

proto/ — Protobuf service and message definitions (the API contracts)
feature-flag/ — Feature flag configurations (Jsonnet files)
webapp/web, accounts-ui/web — Frontend applications
common/ — Shared libraries
access-control*, auth* — Authorization and authentication
api-server/, api-client/ — API layer
cluster-* — Cluster management
billing* — Billing and metering
Service directories are generally named after the service (e.g., sql-endpoint/, model-serving/)

Search Strategy

Follow this iterative loop — the same grep-read-grep-read cycle that works locally, but via Sourcegraph.

Step 1: Find entry points

Start broad to locate where the relevant code lives.

sg-search "feature flag evaluation repo:databricks-eng/universe count:10"

For error messages or user-facing strings, search the exact text:

sg-search '"workspace limit exceeded" repo:databricks-eng/universe'

For API endpoints, search the path:

sg-search '"/api/2.0/clusters/create" repo:databricks-eng/universe'

Step 2: Narrow with filters

Once you know the general area, add filters to cut noise:

Exclude tests: -file:test -file:Test -file:spec
Language: lang:scala, lang:python, lang:java, lang:go, lang:rust, lang:typescript
File path: file:\.scala$, file:proto/, file:src/main
Directory: file:sql-endpoint/ to scope to a service

Example progression:

sg-search "ClusterCreateRequest repo:databricks-eng/universe count:10"
sg-search "ClusterCreateRequest repo:databricks-eng/universe lang:scala -file:test count:10"
sg-search "ClusterCreateRequest repo:databricks-eng/universe file:proto/ count:10"

Step 3: Read the code

When you find a relevant file, read it to understand context:

sg-read databricks-eng/universe path/to/File.scala

For large files, use line ranges (read ~100 lines around the match):

sg-read databricks-eng/universe path/to/File.scala 50 150

Step 4: Follow the trail

From what you've read, identify the next thing to search for:

A function is called → search for its definition: "def functionName" lang:scala
A class is used → search for its definition and usages
A proto message is referenced → search file:proto/ for its definition
A feature flag is checked → search file:feature-flag/ for its configuration
A config constant is used → search for where it's defined: "val CONSTANT_NAME"
An error is thrown → trace back to what condition triggers it
A trait/interface is extended → search for "extends TraitName" or "with TraitName"

Step 5: Check related artifacts

Depending on the question, also search for:

Proto definitions: file:proto/ MessageName — the API contract
Feature flags: file:feature-flag/ flagName — whether a feature is gated
BUILD files: file:BUILD.bazel serviceName — what depends on what
Config/limits: search for constants, env vars, or config keys
Error messages: search the exact user-facing error string

Step 6: Synthesize

After 2-5 iterations of search-read-follow, present your answer with:

Direct answer to the question
Code references — cite specific files and line numbers (format: repo/path/to/file.ext:L123)
The chain of reasoning — briefly explain how you traced through the code
Caveats — note if you couldn't find definitive proof, if the behavior might be behind a feature flag, or if there are multiple code paths

Query Syntax Reference

Syntax	Purpose	Example
`repo:org/name`	Filter to repository	`repo:databricks-eng/universe`
`file:path`	Filter to file path (regex)	`file:\.scala$`
`-file:path`	Exclude file path	`-file:test`
`lang:name`	Filter by language	`lang:scala`
`type:symbol`	Search symbols only	`type:symbol ClusterManager`
`type:diff`	Search diffs/changes	`type:diff removed feature`
`type:commit`	Search commit messages	`type:commit "fix cluster limit"`
`case:yes`	Case-sensitive	`case:yes MAX_NODES`
`"exact phrase"`	Exact match	`"permission denied"`
`/regex/`	Regular expression	`/cluster.limit.\d+/`
`OR`	Boolean OR	`ClusterManager OR ClusterService`
`NOT`	Boolean NOT	`ClusterCreate NOT test`
`count:N`	Max results	`count:20`
`repo:org/name@branch`	Specific branch	`repo:databricks-eng/universe@main`

Common Investigation Patterns

"Why can't customers do X?"

Search for the error message they see
Find the validation/check that produces it
Trace the condition — is it a hard limit? feature flag? permission check?
Check if there's a feature flag that gates it
Look for related config constants or limits

"How does service/feature X work?"

Search for the service name in proto definitions (the API contract)
Find the main handler/controller class
Read the core logic, following key method calls
Check what other services it calls (look for RPC/HTTP client usage)
Identify the data flow: request → validation → processing → response

"What are the limits/constraints of X?"

Search for constants: MAX_, LIMIT_, DEFAULT_
Search for validation methods related to the feature
Check config files and feature flags
Look for error messages about limits being exceeded

"What changed recently in X?"

Use type:diff search to find recent changes
Use type:commit to search commit messages
Focus on the relevant directory/service

"How do services X and Y interact?"

Find proto definitions for both services
Search for client/stub usage of one service within the other
Look for shared proto messages or common dependencies
Check BUILD.bazel deps to understand the dependency graph

Tips

Proto files (*.proto) are the best starting point for understanding any API — they define the contract.
Feature flags in feature-flag/ are Jsonnet files. Search there to understand what's gated.
If a search returns too many results, add -file:test -file:mock -file:fake to exclude test infrastructure.
Services in universe typically follow a pattern: proto/ defines the API, a top-level directory contains the implementation, and BUILD.bazel files show dependencies.
When tracing Scala code, look for extends ConsoleLogging and with clauses to understand mixins.
For Spark/runtime questions, start in the runtime repo. For everything else, start in universe.
Always include count:N in searches to control result volume. Start with count:10, increase if needed.
Use sg-read with line ranges when files are large — reading 100 lines at a time keeps context manageable.