Skip to content

bssh-server v2.1.1: SFTP throughput ~50% of OpenSSH despite aws-lc-rs; framework/protocol overhead suspected #187

@Yaminyam

Description

@Yaminyam

Summary

Under identical hardware and workload, bssh-server sustains roughly half the SFTP throughput of OpenSSH's sftp-server. The gap does not appear to come from the cryptographic primitives themselves (bssh-russh uses aws-lc-rs for AEAD ciphers by default), but from per-SSH-packet framework and protocol overhead around the crypto path.

This matters because a realistic goal for bssh-server in Backend.AI is to replace dedicated SFTP agents (which currently rely on OpenSSH). On slower/enterprise-grade CPUs, the current overhead effectively halves single-connection SFTP throughput.

Environment

  • bssh / bssh-server: v2.1.1 (linux-x86_64-musl build)
  • Client: OpenSSH sftp (Ubuntu 22.04)
  • Transfer: 1 GiB random file via sftp put (SFTP subsystem)
  • Agent hosts tested:
    • "slower" host — Intel Xeon Silver 4214 @ 2.20 GHz, AES-NI, 1 Gbps internal
    • "faster" host — AMD EPYC 7742 @ 2.25 GHz (boost 3.4 GHz), AES-NI + VAES, 1 Gbps internal
  • Client and servers are in the same datacenter; no network bottleneck observed for OpenSSH.

Measurements (1 GiB upload, sftp put, average of 3 runs)

Server CPU Container cores Cipher Throughput
OpenSSH (sftp-server) Xeon Silver 4214 1 chacha20-poly1305 101 MiB/s (NIC-limited)
bssh-server Xeon Silver 4214 1 chacha20-poly1305 57 MiB/s
bssh-server Xeon Silver 4214 1 aes256-gcm@openssh.com 57 MiB/s
bssh-server Xeon Silver 4214 1 aes256-ctr 41 MiB/s
bssh-server Xeon Silver 4214 1 aes128-ctr 48 MiB/s
bssh-server Xeon Silver 4214 2 chacha20-poly1305 60 MiB/s (≈ +5%)
bssh-server EPYC 7742 1 chacha20-poly1305 94 MiB/s
bssh-server EPYC 7742 2 chacha20-poly1305 100 MiB/s

Observations:

  • AEAD ciphers that route through aws-lc-rs (AES-GCM, ChaCha20-Poly1305) cap at the same ~57 MiB/s on Xeon Silver, even though the underlying primitives can do several GB/s. This strongly suggests the bottleneck is outside the AEAD primitive.
  • AES-CTR modes are additionally slowed by going through pure-Rust `aes`/`ctr` (see `crates/bssh-russh/src/cipher/block.rs`) plus a separate HMAC pass.
  • Adding a second core yields only ~5%; a single SSH connection is inherently mostly sequential, so the gain is expected to be small. The CPU architecture (Zen 2 + VAES vs. Cascade Lake) explains most of the host-to-host spread.
  • CPU profile during transfer: one tokio worker thread pegs ~90% CPU while the other tokio workers stay near 0%. The process is CPU-bound on this single hot thread.

Why the gap is interesting

`bssh-russh` already uses `aws-lc-rs` for AEAD ciphers (`crates/bssh-russh/src/cipher/gcm.rs`, `crates/bssh-russh/src/cipher/chacha20poly1305.rs`), which is the same family of assembly-optimised code that OpenSSL/OpenSSH use. So the crypto primitive cannot be twice as slow as OpenSSH — yet the end-to-end SFTP throughput is. Meaning the extra cycles are almost certainly spent in the code around the primitive.

Hypotheses for the per-packet overhead

Unverified; posted for discussion / profiling.

  1. Per-packet AEAD invocation cost. SSH packets default to ~32 KiB, so a 1 GiB transfer issues ~32k encrypt/decrypt calls. Each call goes through `BoundKey` + `NonceSequence` construction and crosses the aws-lc-rs FFI boundary. OpenSSH amortises this with long-lived cipher contexts and direct function calls.
  2. Extra buffer copies. Network buffer → decrypt buffer → russh channel → russh-sftp layer → file write: each stage potentially memcpys the full 32 KiB. OpenSSH's `sftp-server` uses `sendfile`/`splice` where possible and keeps data in fewer buffers.
  3. Tokio channel/task overhead per SSH packet. Small-packet async pipelines are known to have non-trivial cost per message (channel send + await + wake), which becomes dominant when the crypto itself is fast.
  4. russh-sftp protocol layer. Worth profiling to see whether SFTP request/response handling is request-blocked rather than pipelined.

Suggested investigations / directions

  1. Profile with `cargo flamegraph` on a 1 GiB SFTP upload. This should immediately show whether time is spent in aws-lc-rs, in russh packet framing, in russh-sftp, or in tokio.
  2. Increase SSH/SFTP packet size. Larger packets amortise per-call overhead; a quick experiment with the server's `max_packet_size` / SFTP buffer would confirm whether per-call overhead is the dominant factor.
  3. Zero-copy file I/O on the server side. Teach russh-sftp to use `sendfile`/`splice` when the destination is a regular file.
  4. Route AES-CTR through aws-lc-rs (or drop CTR ciphers from the default list). Currently `crates/bssh-russh/src/cipher/block.rs` uses pure-Rust `aes`/`ctr`, which is a secondary but real contributor for users who negotiate CTR.
  5. Build-time optimisations. Confirm LTO / `codegen-units = 1` / `target-cpu` settings used for release binaries; PGO can be meaningful for crypto+I/O hot paths.

Reproduction

# Server (inside a container on the slower host)
/tmp/bssh-server gen-host-key --output /tmp/bssh_host_key -t ed25519
/tmp/bssh-server run -b 0.0.0.0 -p 2200 -k /tmp/bssh_host_key -D

# Client
dd if=/dev/urandom of=/tmp/testfile_1G bs=1M count=1024
echo 'put /tmp/testfile_1G testfile_1G' | \
  sftp -i id_container -P 2200 \
       -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
       work@<server>

Cross-check with a stock OpenSSH `sftp-server` on the same host/container/user for a baseline.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions