ELE-11: Add YAML-mediated search pipeline to Bergson#246
ELE-11: Add YAML-mediated search pipeline to Bergson#246jammastergirish wants to merge 10 commits intomainfrom
Conversation
|
@norabelrose but report in the chunk_length logic
|
Documentation is not why we list direct dependencies |
No, sounds too complex |
It's not just documentation. What if, for example, the first library drops the one that's not explicitly named and replaces it with another? |
|
I can't see that it's not passing typecheck yet but the PR looks amazing! Let me know when it's ready for review |
Thanks! I think it's ready notwithstanding the bug mentioned in Linear: https://linear.app/eleutherai-interpretability/issue/ELE-11/add-yaml-mediated-search-pipeline-to-bergson#comment-c68c7bfc |
|
This is looking really fabulous!! I added what amounts to style nits. With a green build and those addressed, LGTM. |
| token_batch_size: 1024 | ||
| data: | ||
| chunk_length: 1024 | ||
| preprocess_cfg: {} |
There was a problem hiding this comment.
Noting that this yaml file doesn't do what you would expect because we don't yet support building a Hessian over compressed gradients, so the Hessian and the (compressed by default) index here would be incompatible
luciaquirke
left a comment
There was a problem hiding this comment.
Looking great! Ideally we'd hold off on merging the hessian_then_build.yaml until we have EK-FAC support in score, at which point we'd merge it as a hessian_then_score.yaml. Everything else looks ready to merge.
KFAC's _init_covariance_dict (bergson/hessians/sharded_computation.py)
assumes nn.Linear weight ordering (out, in). HuggingFace's Conv1D —
used in GPT-2's c_attn / c_proj / c_fc — stores weight as (in, out),
the transpose, so any KFAC/EKFAC run on GPT-2 (or other Conv1D
architectures) currently crashes on the first forward hook with:
RuntimeError: The size of tensor a (2304) must match the size of
tensor b (768) at non-singleton dimension 1
Pythia uses nn.Linear throughout, so it sidesteps the bug. This is a
workaround so the shipped example actually runs end-to-end; the
underlying bug is tracked in issue #254 and the example should be
swapped back to gpt2 once that lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two steps previously shared run_path: runs/autocorrelation_and_index, which made the example only runnable from a clean filesystem. After step 1 (hessian) wrote into that directory, step 2 (build) hit validate_run_path's existence check and aborted with FileExistsError on every subsequent invocation. The shared run_path also implied data flow between the steps that doesn't actually exist — the build step has no processor_path / preconditioner_path referencing the hessian's output, so the steps are adjacent rather than chained. Switch each step to its own run_path (runs/example_hessian and runs/example_build) and set overwrite: true so the example reruns cleanly without manual `rm -r`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Run a Multi-Step Pipeline" section under Score a Dataset introducing `bergson pipeline <yaml>` and pointing readers to the hessian_then_build.yaml example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
See notes in Linear on the test, which seemed to expose a bug deeper in the library. (I've now dealt with this bug here.
Adds a generic
bergson pipeline <file.yaml>command that runs a sequence of existing bergson subcommands defined in a YAML file. This unblocks pipelines (build_EKFAC.yaml,query_EKFAC.yaml, …) without writing a new Python entry point per pipeline.bergson/yaml_pipeline.py:parse_pipeline()reads the YAML once, validates its shape, and hydrates each step into the matching command dataclass viasimple_parsing'sSerializable.from_dict. Defaults from the existing dataclasses fill in any unspecified fields automatically. All steps are parsed up-front so a config error in step N fails before step 1 burns compute.run_pipeline()parses then executes in order.bergson/__main__.py: adds a third branch tomain()for pipeline mode alongside the existing single-command-from-file and CLI-flag modes. Hoists the{name: class}registry out of the existing branch so both branches share it.pyproject.toml: declarespyyamlas a direct dep. (It was already required transitively bySerializable.loadfor the existing single-command YAML flow; this just documents that.)examples/pipelines/hessian_then_build.yaml: minimal hessian-then-build pipeline.YAML format
A pipeline is a list of single-key mappings. Each entry's key is a registered command name (case-insensitive) and its value is the same config tree that command accepts as a single-command YAML so each step is fully self-contained.
A list (not a mapping) is required because YAML mappings can't preserve order or duplicates, and pipelines need both.
What's deferred
run_pathinjected into every step) are not supported — each step must currently repeat its config. Easy to layer on later via a top-leveldefaults:block; should we do this, @luciaquirke?Test_Model_Configurationcan't be a pipeline step becauseDiagnoseConfigisn'tSerializable. This is a pre-existing limitation —bergson test_model_configuration <yaml>doesn't work today either — and out of scope here.Follow-up needed for
build_EKFAC.yamlandquery_EKFAC.yamlbuild_EKFAC.yamlwould work today by wrapping the existinghessiancommand.query_EKFAC.yamlis partially blocked: step 3 of the EKFAC flow ("apply inverse Hessian to the mean query gradient") is not currently exposed as a top-level CLI command — it lives asapply_workerinsidebergson/hessians/pipeline.py. A small follow-up PR will lift it into anapply_hessiancommand so the full query pipeline can be expressed as YAML.(@luciaquirke I had a go at this with Claude here but closed that PR so we can focus on this one. Not looked at that yet.)