ELE-11: Add YAML-mediated search pipeline to Bergson by jammastergirish · Pull Request #246 · EleutherAI/bergson

jammastergirish · 2026-04-28T17:03:27Z

See notes in Linear on the test, which seemed to expose a bug deeper in the library. (I've now dealt with this bug here.

Adds a generic bergson pipeline <file.yaml> command that runs a sequence of existing bergson subcommands defined in a YAML file. This unblocks pipelines (build_EKFAC.yaml, query_EKFAC.yaml, …) without writing a new Python entry point per pipeline.

New module — bergson/yaml_pipeline.py: parse_pipeline() reads the YAML once, validates its shape, and hydrates each step into the matching command dataclass via simple_parsing's Serializable.from_dict. Defaults from the existing dataclasses fill in any unspecified fields automatically. All steps are parsed up-front so a config error in step N fails before step 1 burns compute. run_pipeline() parses then executes in order.
bergson/__main__.py: adds a third branch to main() for pipeline mode alongside the existing single-command-from-file and CLI-flag modes. Hoists the {name: class} registry out of the existing branch so both branches share it.
pyproject.toml: declares pyyaml as a direct dep. (It was already required transitively by Serializable.load for the existing single-command YAML flow; this just documents that.)
Example — examples/pipelines/hessian_then_build.yaml: minimal hessian-then-build pipeline.

YAML format

A pipeline is a list of single-key mappings. Each entry's key is a registered command name (case-insensitive) and its value is the same config tree that command accepts as a single-command YAML so each step is fully self-contained.

- hessian:
    hessian_cfg:
      method: kfac
    index_cfg:
      run_path: runs/example
      model: gpt2
- build:
    index_cfg:
      run_path: runs/example
      model: gpt2
    preprocess_cfg: {}

A list (not a mapping) is required because YAML mappings can't preserve order or duplicates, and pipelines need both.

What's deferred

Shared values across steps (e.g. one run_path injected into every step) are not supported — each step must currently repeat its config. Easy to layer on later via a top-level defaults: block; should we do this, @luciaquirke?
Test_Model_Configuration can't be a pipeline step because DiagnoseConfig isn't Serializable. This is a pre-existing limitation — bergson test_model_configuration <yaml> doesn't work today either — and out of scope here.

Follow-up needed for `build_EKFAC.yaml` and `query_EKFAC.yaml`

build_EKFAC.yaml would work today by wrapping the existing hessian command.
query_EKFAC.yaml is partially blocked: step 3 of the EKFAC flow ("apply inverse Hessian to the mean query gradient") is not currently exposed as a top-level CLI command — it lives as apply_worker inside bergson/hessians/pipeline.py. A small follow-up PR will lift it into an apply_hessian command so the full query pipeline can be expressed as YAML.

(@luciaquirke I had a go at this with Claude here but closed that PR so we can focus on this one. Not looked at that yet.)

CLAassistant · 2026-04-28T17:03:33Z

All committers have signed the CLA.

luciaquirke · 2026-04-28T22:33:28Z

@norabelrose but report in the chunk_length logic

When chunk_length > 0, the resulting dataset doesn't have a length column, but hessian_worker and build both rely on
index_cfg.token_batch_size and ds["length"] to allocate batches.

File "/root/bergson/bergson/hessians/hessian_approximations.py", line 159, in hessian_worker
batches = allocate_batches(ds["length"][:], index_cfg.token_batch_size)
ValueError: Column 'length' doesn't exist.

This is a real bug in the chunked path of setup_data_pipeline: tokenize_and_chunk returns columns input_ids and
doc_ids only, but downstream code (hessian_worker, build, score) all batch via ds["length"]. The non-chunked path
gets length for free from tokenizer(..., return_length=True).

Cheapest fix: add a constant length column in the chunked branch of setup_data_pipeline (every chunk is chunk_length
tokens by construction).

luciaquirke · 2026-04-28T22:35:29Z

pyproject.toml: declares pyyaml as a direct dep. (It was already required transitively by Serializable.load for the existing single-command YAML flow; this just documents that.)

Documentation is not why we list direct dependencies

luciaquirke · 2026-04-28T22:36:18Z

Shared values across steps (e.g. one run_path injected into every step) are not supported — each step must currently repeat its config. Easy to layer on later via a top-level defaults: block; should we do this, @luciaquirke?

No, sounds too complex

jammastergirish · 2026-04-28T22:37:38Z

pyproject.toml: declares pyyaml as a direct dep. (It was already required transitively by Serializable.load for the existing single-command YAML flow; this just documents that.)

Documentation is not why we list direct dependencies

It's not just documentation. What if, for example, the first library drops the one that's not explicitly named and replaces it with another?

luciaquirke · 2026-04-28T22:38:56Z

I can't see that it's not passing typecheck yet but the PR looks amazing! Let me know when it's ready for review

jammastergirish · 2026-04-28T22:40:46Z

I can't see that it's not passing typecheck yet but the PR looks amazing! Let me know when it's ready for review

Thanks!

I think it's ready notwithstanding the bug mentioned in Linear: https://linear.app/eleutherai-interpretability/issue/ELE-11/add-yaml-mediated-search-pipeline-to-bergson#comment-c68c7bfc

luciaquirke · 2026-04-28T22:48:50Z

This is looking really fabulous!! I added what amounts to style nits. With a green build and those addressed, LGTM.

luciaquirke · 2026-04-28T22:54:55Z

+      token_batch_size: 1024
+      data:
+        chunk_length: 1024
+    preprocess_cfg: {}


Noting that this yaml file doesn't do what you would expect because we don't yet support building a Hessian over compressed gradients, so the Hessian and the (compressed by default) index here would be incompatible

luciaquirke

Looking great! Ideally we'd hold off on merging the hessian_then_build.yaml until we have EK-FAC support in score, at which point we'd merge it as a hessian_then_score.yaml. Everything else looks ready to merge.

KFAC's _init_covariance_dict (bergson/hessians/sharded_computation.py) assumes nn.Linear weight ordering (out, in). HuggingFace's Conv1D — used in GPT-2's c_attn / c_proj / c_fc — stores weight as (in, out), the transpose, so any KFAC/EKFAC run on GPT-2 (or other Conv1D architectures) currently crashes on the first forward hook with: RuntimeError: The size of tensor a (2304) must match the size of tensor b (768) at non-singleton dimension 1 Pythia uses nn.Linear throughout, so it sidesteps the bug. This is a workaround so the shipped example actually runs end-to-end; the underlying bug is tracked in issue #254 and the example should be swapped back to gpt2 once that lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two steps previously shared run_path: runs/autocorrelation_and_index, which made the example only runnable from a clean filesystem. After step 1 (hessian) wrote into that directory, step 2 (build) hit validate_run_path's existence check and aborted with FileExistsError on every subsequent invocation. The shared run_path also implied data flow between the steps that doesn't actually exist — the build step has no processor_path / preconditioner_path referencing the hessian's output, so the steps are adjacent rather than chained. Switch each step to its own run_path (runs/example_hessian and runs/example_build) and set overwrite: true so the example reruns cleanly without manual `rm -r`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a "Run a Multi-Step Pipeline" section under Score a Dataset introducing `bergson pipeline <yaml>` and pointing readers to the hessian_then_build.yaml example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds YAML parse pipeline

4ab882c

jammastergirish requested a review from luciaquirke April 28, 2026 17:03

jammastergirish mentioned this pull request Apr 28, 2026

Expose apply_hessian as a top-level CLI command #247

Closed

4 tasks

jammastergirish added 3 commits April 28, 2026 10:35

Adds token_batch_size: 1024

08a8635

Adds data: chunk_length: 1024

c1d951c

Properly indents data/chunk_length

c8e7d29