Skip to content

ELE-11: Add YAML-mediated search pipeline to Bergson#246

Open
jammastergirish wants to merge 10 commits intomainfrom
add_yaml-mediated_search_pipeline
Open

ELE-11: Add YAML-mediated search pipeline to Bergson#246
jammastergirish wants to merge 10 commits intomainfrom
add_yaml-mediated_search_pipeline

Conversation

@jammastergirish
Copy link
Copy Markdown
Collaborator

@jammastergirish jammastergirish commented Apr 28, 2026

See notes in Linear on the test, which seemed to expose a bug deeper in the library. (I've now dealt with this bug here.

Adds a generic bergson pipeline <file.yaml> command that runs a sequence of existing bergson subcommands defined in a YAML file. This unblocks pipelines (build_EKFAC.yaml, query_EKFAC.yaml, …) without writing a new Python entry point per pipeline.

  • New modulebergson/yaml_pipeline.py: parse_pipeline() reads the YAML once, validates its shape, and hydrates each step into the matching command dataclass via simple_parsing's Serializable.from_dict. Defaults from the existing dataclasses fill in any unspecified fields automatically. All steps are parsed up-front so a config error in step N fails before step 1 burns compute. run_pipeline() parses then executes in order.
  • bergson/__main__.py: adds a third branch to main() for pipeline mode alongside the existing single-command-from-file and CLI-flag modes. Hoists the {name: class} registry out of the existing branch so both branches share it.
  • pyproject.toml: declares pyyaml as a direct dep. (It was already required transitively by Serializable.load for the existing single-command YAML flow; this just documents that.)
  • Exampleexamples/pipelines/hessian_then_build.yaml: minimal hessian-then-build pipeline.

YAML format

A pipeline is a list of single-key mappings. Each entry's key is a registered command name (case-insensitive) and its value is the same config tree that command accepts as a single-command YAML so each step is fully self-contained.

- hessian:
    hessian_cfg:
      method: kfac
    index_cfg:
      run_path: runs/example
      model: gpt2
- build:
    index_cfg:
      run_path: runs/example
      model: gpt2
    preprocess_cfg: {}

A list (not a mapping) is required because YAML mappings can't preserve order or duplicates, and pipelines need both.

What's deferred

  • Shared values across steps (e.g. one run_path injected into every step) are not supported — each step must currently repeat its config. Easy to layer on later via a top-level defaults: block; should we do this, @luciaquirke?
  • Test_Model_Configuration can't be a pipeline step because DiagnoseConfig isn't Serializable. This is a pre-existing limitation — bergson test_model_configuration <yaml> doesn't work today either — and out of scope here.

Follow-up needed for build_EKFAC.yaml and query_EKFAC.yaml

  • build_EKFAC.yaml would work today by wrapping the existing hessian command.
  • query_EKFAC.yaml is partially blocked: step 3 of the EKFAC flow ("apply inverse Hessian to the mean query gradient") is not currently exposed as a top-level CLI command — it lives as apply_worker inside bergson/hessians/pipeline.py. A small follow-up PR will lift it into an apply_hessian command so the full query pipeline can be expressed as YAML.

(@luciaquirke I had a go at this with Claude here but closed that PR so we can focus on this one. Not looked at that yet.)

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 28, 2026

CLA assistant check
All committers have signed the CLA.

@luciaquirke
Copy link
Copy Markdown
Collaborator

@norabelrose but report in the chunk_length logic

When chunk_length > 0, the resulting dataset doesn't have a length column, but hessian_worker and build both rely on
index_cfg.token_batch_size and ds["length"] to allocate batches.

File "/root/bergson/bergson/hessians/hessian_approximations.py", line 159, in hessian_worker
batches = allocate_batches(ds["length"][:], index_cfg.token_batch_size)
ValueError: Column 'length' doesn't exist.

This is a real bug in the chunked path of setup_data_pipeline: tokenize_and_chunk returns columns input_ids and
doc_ids only, but downstream code (hessian_worker, build, score) all batch via ds["length"]. The non-chunked path
gets length for free from tokenizer(..., return_length=True).

Cheapest fix: add a constant length column in the chunked branch of setup_data_pipeline (every chunk is chunk_length
tokens by construction).

@luciaquirke
Copy link
Copy Markdown
Collaborator

pyproject.toml: declares pyyaml as a direct dep. (It was already required transitively by Serializable.load for the existing single-command YAML flow; this just documents that.)

Documentation is not why we list direct dependencies

@luciaquirke
Copy link
Copy Markdown
Collaborator

luciaquirke commented Apr 28, 2026

Shared values across steps (e.g. one run_path injected into every step) are not supported — each step must currently repeat its config. Easy to layer on later via a top-level defaults: block; should we do this, @luciaquirke?

No, sounds too complex

@jammastergirish
Copy link
Copy Markdown
Collaborator Author

jammastergirish commented Apr 28, 2026

pyproject.toml: declares pyyaml as a direct dep. (It was already required transitively by Serializable.load for the existing single-command YAML flow; this just documents that.)

Documentation is not why we list direct dependencies

It's not just documentation. What if, for example, the first library drops the one that's not explicitly named and replaces it with another?

@luciaquirke
Copy link
Copy Markdown
Collaborator

luciaquirke commented Apr 28, 2026

I can't see that it's not passing typecheck yet but the PR looks amazing! Let me know when it's ready for review

Comment thread bergson/__main__.py Outdated
@jammastergirish
Copy link
Copy Markdown
Collaborator Author

I can't see that it's not passing typecheck yet but the PR looks amazing! Let me know when it's ready for review

Thanks!

I think it's ready notwithstanding the bug mentioned in Linear: https://linear.app/eleutherai-interpretability/issue/ELE-11/add-yaml-mediated-search-pipeline-to-bergson#comment-c68c7bfc

Comment thread bergson/__main__.py
Comment thread bergson/__main__.py Outdated
Comment thread bergson/__main__.py Outdated
Comment thread bergson/__main__.py Outdated
@luciaquirke
Copy link
Copy Markdown
Collaborator

This is looking really fabulous!! I added what amounts to style nits. With a green build and those addressed, LGTM.

Comment thread examples/pipelines/hessian_then_build.yaml Outdated
token_batch_size: 1024
data:
chunk_length: 1024
preprocess_cfg: {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this yaml file doesn't do what you would expect because we don't yet support building a Hessian over compressed gradients, so the Hessian and the (compressed by default) index here would be incompatible

Copy link
Copy Markdown
Collaborator

@luciaquirke luciaquirke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Ideally we'd hold off on merging the hessian_then_build.yaml until we have EK-FAC support in score, at which point we'd merge it as a hessian_then_score.yaml. Everything else looks ready to merge.

jammastergirish and others added 3 commits April 29, 2026 10:02
KFAC's _init_covariance_dict (bergson/hessians/sharded_computation.py)
assumes nn.Linear weight ordering (out, in). HuggingFace's Conv1D —
used in GPT-2's c_attn / c_proj / c_fc — stores weight as (in, out),
the transpose, so any KFAC/EKFAC run on GPT-2 (or other Conv1D
architectures) currently crashes on the first forward hook with:

    RuntimeError: The size of tensor a (2304) must match the size of
    tensor b (768) at non-singleton dimension 1

Pythia uses nn.Linear throughout, so it sidesteps the bug. This is a
workaround so the shipped example actually runs end-to-end; the
underlying bug is tracked in issue #254 and the example should be
swapped back to gpt2 once that lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two steps previously shared run_path: runs/autocorrelation_and_index,
which made the example only runnable from a clean filesystem. After
step 1 (hessian) wrote into that directory, step 2 (build) hit
validate_run_path's existence check and aborted with FileExistsError
on every subsequent invocation.

The shared run_path also implied data flow between the steps that
doesn't actually exist — the build step has no processor_path /
preconditioner_path referencing the hessian's output, so the steps
are adjacent rather than chained.

Switch each step to its own run_path (runs/example_hessian and
runs/example_build) and set overwrite: true so the example reruns
cleanly without manual `rm -r`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Run a Multi-Step Pipeline" section under Score a Dataset
introducing `bergson pipeline <yaml>` and pointing readers to the
hessian_then_build.yaml example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants