Skip to content

Unifies Hessian computation; renames preconditioner → hessian #245

Open
jammastergirish wants to merge 6 commits intoEleutherAI:mainfrom
jammastergirish:feature/unified-hessian
Open

Unifies Hessian computation; renames preconditioner → hessian #245
jammastergirish wants to merge 6 commits intoEleutherAI:mainfrom
jammastergirish:feature/unified-hessian

Conversation

@jammastergirish
Copy link
Copy Markdown
Collaborator

Unifies all Hessian estimation methods under one umbrella. The old Preconditioners CLI subcommand now just one method choice on the existing Hessian subcommand, and naming across the codebase follows: "preconditioner" → "hessian".

What changed:

  1. bergson preconditioners subcommand removed;
    folded into bergson hessian (1588e35).

    • HessianConfig.method gains "autocorrelation"
      and makes it the new default. bergson hessian --method autocorrelation does what bergson preconditioners
      used to do; --method kfac|tkfac|shampoo are unchanged.
    • Breaking: scripts/configs invoking bergson preconditioners ... must move to bergson hessian --method autocorrelation ....
  2. Codebase-wide rename: preconditioner*
    hessian* (and precond shorthand → hess)
    (546019d).
    Verb forms (preconditioning,
    preconditioned) preserved as descriptive math
    language. Public API renames worth flagging:

    • GradientProcessor(preconditioners=…, preconditioners_eigen=…)hessians=,
      hessians_eigen=
    • GradientProcessor.load(skip_preconditioners=…)
      skip_hessians=
    • bergson.mix_preconditioners
      bergson.mix_hessians
    • Module renames: bergson.process_preconditioners
      bergson.process_hessians; bergson.collector.dist_pre conditioners_gradient_collector
      bergson.collector.dist_hessians_gradient_collector;
      examples.semantic.preconditioners
      examples.semantic.hessians.
  3. **GradientProcessor.load() is backward compatible
    hessians_eigen.pth are absent it falls back to
    preconditioners.pth / preconditioners_eigen.pth.
    Also defensively migrates any legacy preconditioner*
    keys appearing in processor_config.yaml.

This is effectively a find/replace. However, some file names were called precondition*.pft. They've been renamed. However, if they can't be found in gradients.py, the old file name is sought instead.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 28, 2026

CLA assistant check
All committers have signed the CLA.

@luciaquirke
Copy link
Copy Markdown
Collaborator

luciaquirke commented Apr 28, 2026

Looking awesome. Two things:

Where you insert backwards compatibility could you please add a
# TODO Lucia Quirke: remove on the 28 October 2026

Comment so I can sweep through and remove it at some point when people are safely migrated.

Second, there's a user-facing flag in config.py formerly called preconditioner_path that's used here

# Load preconditioners on device one-by-one for memory efficiency
preconditioners = GradientProcessor.load(
    Path(preconditioner_path),
    map_location="cpu",
).preconditioners

I think it should be called "processor_path" because it's actually a path to a GradientProcessor. And its docstring was """Path to a precomputed preconditioner.""" I think it should be """Path to a precomputed gradient processor. Set to apply a Hessian approximation."""

I think this is 95% ready

@luciaquirke
Copy link
Copy Markdown
Collaborator

luciaquirke commented Apr 28, 2026

@LouisYRYJ @SrGonao @norabelrose we're interested in making this refactor that makes a small change that nonetheless touches many files to move towards unified handling for our various Hessian approximations. But if this change seems too large or risky, the first commit (unifying the command line tools and nothing else) could be merged now along with the preconditioner_path -> processor_path rename, and we could leave the larger naming refactor for later (post paper submissions). Please shout out if you want this! Otherwise we'll make the call in around 24 hours.

Comment thread benchmarks/benchmark_factors.py Outdated


BERGSON_FACTOR_TYPES = {"normalizer", "preconditioner", "kfac", "ekfac"}
BERGSON_FACTOR_TYPES = {"normalizer", "hessian", "kfac", "ekfac"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I think we should say something like "auto_correlation_hessian" or maybe "ac_hessian", probably there are better names out there

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autocorrelation make sense to me since the others don't have the _hessian suffix

@LouisYRYJ
Copy link
Copy Markdown
Contributor

Thanks a lot Girish, this has been laying around for quite some time and I am happy to see this being picked up.
The changes look good to me! I have a few comments, not very loadbearing, and from my side we are good to merge.

  • In general, I would like to avoid calling the autocorrelation projected Hessian the "hessian", using it as a variable name is fine, but I left an example where I think it may be a bit misleading?

(Importantly, we are at no point computing the actual true Hessian in this library, but always approximations thereof. This makes it generally difficult to be precise when naming things and talking about them, but I have found no good solution for this and also defaulted to calling some functions/files "hessian_sth_sth" for the sake of brevity even though that is technically not correct.)

  • Out of scope for this PR, but one place where we have some redundancy: GradientCollectorWithDistributedHessians is using a different distributed logic than for the Hessian approximations I have been using. The first shards tensors according to modules, while the latter shards tensors themselves (using the 0th dim to split them across ranks). The second right now doesn't handle the case where tensor.shape[0] is not divisible by world_size gracefully, but in general is fairer and ensures that memory is distributed as fairly as possible (which in this case is pretty important). I would definitely like to handle the "tensor.shape[0] is not divisible by world_size" case in future (presumably here https://github.com/jammastergirish/bergson/blob/c267fcb394ca40814c5d87714c9d0ba6bc10fbd8/bergson/hessians/sharded_computation.py)

  • Small nit: "Hessian" is a noun, so in docs we should capitalize it

Looking forward to your future contributions :)

@jammastergirish
Copy link
Copy Markdown
Collaborator Author

Thank you so much for this @LouisYRYJ.

I'll discuss with @luciaquirke the language to use. Agree that hessian may be too broad.

Let's deal with that redundancy in another PR.

And thanks on the proper noun capitaliztion!

@luciaquirke
Copy link
Copy Markdown
Collaborator

Yeah this is somewhat gnarly but it would be really good to find anywhere where we did a "preconditioner" -> "hessian" rename where it would be more appropriate to do "preconditioner" -> "autocorrelation" (like if we're specifically referring to the autocorrelation approximation). I will sweep through and try to do this now.

Re: calling our Hessian approximations "hessian" rather than "hessian_approx", it's definitely not ideal but probably worth it for the conciseness gains.

Comment thread benchmarks/benchmark_factors.py Outdated
Comment thread benchmarks/benchmark_factors.py Outdated
Comment thread benchmarks/benchmark_factors.py Outdated
Comment thread benchmarks/plot_factor_benchmark.py Outdated
Comment thread bergson/collector/dist_hessians_gradient_collector.py Outdated
Comment thread bergson/collector/dist_hessians_gradient_collector.py Outdated
@HookCollectorBase.split_attention_heads
def backward_hook(self, module: nn.Module, g: Float[Tensor, "N S O"]):
"""Compute per-sample gradient, accumulate preconditioner, and store."""
"""Compute per-sample gradient, accumulate hessian, and store."""
Copy link
Copy Markdown
Collaborator

@luciaquirke luciaquirke Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"autocorrelation matrix" in this file

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"accumulate autocorrelation matrix" or "autocorrelation matrix"?

Copy link
Copy Markdown
Collaborator

@luciaquirke luciaquirke Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accumulate for sure, the accumulation is the part where we sum all the outer product matrices to compute an autocorrelation matrix

Comment thread bergson/__init__.py Outdated
Comment thread bergson/__main__.py
Comment thread docs/cli.rst
Comment thread docs/pipeline.rst Outdated
@jammastergirish jammastergirish marked this pull request as draft April 29, 2026 04:09
@jammastergirish jammastergirish marked this pull request as ready for review April 29, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants