Unifies Hessian computation; renames preconditioner → hessian by jammastergirish · Pull Request #245 · EleutherAI/bergson

jammastergirish · 2026-04-28T02:50:31Z

Unifies all Hessian estimation methods under one umbrella. The old Preconditioners CLI subcommand now just one method choice on the existing Hessian subcommand, and naming across the codebase follows: "preconditioner" → "hessian".

What changed:

bergson preconditioners subcommand removed;
folded into bergson hessian (1588e35).
- HessianConfig.method gains "autocorrelation"
  and makes it the new default. bergson hessian --method autocorrelation does what bergson preconditioners
  used to do; --method kfac|tkfac|shampoo are unchanged.
- Breaking: scripts/configs invoking bergson preconditioners ... must move to bergson hessian --method autocorrelation ....
Codebase-wide rename: preconditioner* →
hessian* (and precond shorthand → hess)
(546019d). Verb forms (preconditioning,
preconditioned) preserved as descriptive math
language. Public API renames worth flagging:
- GradientProcessor(preconditioners=…, preconditioners_eigen=…) → hessians=,
  hessians_eigen=
- GradientProcessor.load(skip_preconditioners=…) →
  skip_hessians=
- bergson.mix_preconditioners →
  bergson.mix_hessians
- Module renames: bergson.process_preconditioners →
  bergson.process_hessians; bergson.collector.dist_pre conditioners_gradient_collector →
  bergson.collector.dist_hessians_gradient_collector;
  examples.semantic.preconditioners →
  examples.semantic.hessians.
**GradientProcessor.load() is backward compatible
hessians_eigen.pth are absent it falls back to
preconditioners.pth / preconditioners_eigen.pth.
Also defensively migrates any legacy preconditioner*
keys appearing in processor_config.yaml.

This is effectively a find/replace. However, some file names were called precondition*.pft. They've been renamed. However, if they can't be found in gradients.py, the old file name is sought instead.

CLAassistant · 2026-04-28T02:50:38Z

All committers have signed the CLA.

luciaquirke · 2026-04-28T03:12:05Z

Looking awesome. Two things:

Where you insert backwards compatibility could you please add a
# TODO Lucia Quirke: remove on the 28 October 2026

Comment so I can sweep through and remove it at some point when people are safely migrated.

Second, there's a user-facing flag in config.py formerly called preconditioner_path that's used here

# Load preconditioners on device one-by-one for memory efficiency
preconditioners = GradientProcessor.load(
    Path(preconditioner_path),
    map_location="cpu",
).preconditioners

I think it should be called "processor_path" because it's actually a path to a GradientProcessor. And its docstring was """Path to a precomputed preconditioner.""" I think it should be """Path to a precomputed gradient processor. Set to apply a Hessian approximation."""

I think this is 95% ready

luciaquirke · 2026-04-28T03:14:25Z

@LouisYRYJ @SrGonao @norabelrose we're interested in making this refactor that makes a small change that nonetheless touches many files to move towards unified handling for our various Hessian approximations. But if this change seems too large or risky, the first commit (unifying the command line tools and nothing else) could be merged now along with the preconditioner_path -> processor_path rename, and we could leave the larger naming refactor for later (post paper submissions). Please shout out if you want this! Otherwise we'll make the call in around 24 hours.

LouisYRYJ · 2026-04-28T09:10:51Z



-BERGSON_FACTOR_TYPES = {"normalizer", "preconditioner", "kfac", "ekfac"}
+BERGSON_FACTOR_TYPES = {"normalizer", "hessian", "kfac", "ekfac"}


here I think we should say something like "auto_correlation_hessian" or maybe "ac_hessian", probably there are better names out there

autocorrelation make sense to me since the others don't have the _hessian suffix

LouisYRYJ · 2026-04-28T09:29:18Z

Thanks a lot Girish, this has been laying around for quite some time and I am happy to see this being picked up.
The changes look good to me! I have a few comments, not very loadbearing, and from my side we are good to merge.

In general, I would like to avoid calling the autocorrelation projected Hessian the "hessian", using it as a variable name is fine, but I left an example where I think it may be a bit misleading?

(Importantly, we are at no point computing the actual true Hessian in this library, but always approximations thereof. This makes it generally difficult to be precise when naming things and talking about them, but I have found no good solution for this and also defaulted to calling some functions/files "hessian_sth_sth" for the sake of brevity even though that is technically not correct.)

Out of scope for this PR, but one place where we have some redundancy: GradientCollectorWithDistributedHessians is using a different distributed logic than for the Hessian approximations I have been using. The first shards tensors according to modules, while the latter shards tensors themselves (using the 0th dim to split them across ranks). The second right now doesn't handle the case where tensor.shape[0] is not divisible by world_size gracefully, but in general is fairer and ensures that memory is distributed as fairly as possible (which in this case is pretty important). I would definitely like to handle the "tensor.shape[0] is not divisible by world_size" case in future (presumably here https://github.com/jammastergirish/bergson/blob/c267fcb394ca40814c5d87714c9d0ba6bc10fbd8/bergson/hessians/sharded_computation.py)
Small nit: "Hessian" is a noun, so in docs we should capitalize it

Looking forward to your future contributions :)

jammastergirish · 2026-04-28T19:30:00Z

Thank you so much for this @LouisYRYJ.

I'll discuss with @luciaquirke the language to use. Agree that hessian may be too broad.

Let's deal with that redundancy in another PR.

And thanks on the proper noun capitaliztion!

luciaquirke · 2026-04-28T23:33:46Z

Yeah this is somewhat gnarly but it would be really good to find anywhere where we did a "preconditioner" -> "hessian" rename where it would be more appropriate to do "preconditioner" -> "autocorrelation" (like if we're specifically referring to the autocorrelation approximation). I will sweep through and try to do this now.

Re: calling our Hessian approximations "hessian" rather than "hessian_approx", it's definitely not ideal but probably worth it for the conciseness gains.

luciaquirke · 2026-04-28T23:38:04Z

    @HookCollectorBase.split_attention_heads
    def backward_hook(self, module: nn.Module, g: Float[Tensor, "N S O"]):
-        """Compute per-sample gradient, accumulate preconditioner, and store."""
+        """Compute per-sample gradient, accumulate hessian, and store."""


"autocorrelation matrix" in this file

"accumulate autocorrelation matrix" or "autocorrelation matrix"?

accumulate for sure, the accumulation is the part where we sum all the outer product matrices to compute an autocorrelation matrix

jammastergirish added 2 commits April 27, 2026 17:27

Refactor Hessian

1588e35

Renames precondition* to hessian*

546019d

This is effectively a find/replace. However, some file names were called precondition*.pft. They've been renamed. However, if they can't be found in gradients.py, the old file name is sought instead.

luciaquirke requested review from LouisYRYJ and luciaquirke April 28, 2026 02:56

Adds comment for Lucia

fbd656d

Edits docs

c267fcb

LouisYRYJ reviewed Apr 28, 2026

View reviewed changes

Edits docstrings for processor path

5576d6c