Batch console output to one VM exit per print call#2263
Batch console output to one VM exit per print call#2263stlankes wants to merge 1 commit intohermit-os:mainfrom
Conversation
Increase serial buffer from 256 to 2048 bytes and gate newline-triggered flush on debug_assertions. In release builds, data is only flushed when the buffer is full or explicitly at the end of each _print()/_panic_print() call. This reduces VM exits from ~28 per multi-line info!() to just 1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Benchmark Results
Details
| Benchmark | Current: 84b9af2 | Previous: 04ab3e2 | Performance Ratio |
|---|---|---|---|
| startup_benchmark Build Time | 102.80 s |
98.87 s |
1.04 ❗ |
| startup_benchmark File Size | 0.87 MB |
0.86 MB |
1.01 ❗ |
| Startup Time - 1 core | 0.97 s (±0.03 s) |
0.95 s (±0.03 s) |
1.01 |
| Startup Time - 2 cores | 0.99 s (±0.03 s) |
0.97 s (±0.03 s) |
1.02 |
| Startup Time - 4 cores | 0.98 s (±0.03 s) |
0.96 s (±0.03 s) |
1.02 |
| multithreaded_benchmark Build Time | 101.55 s |
101.80 s |
1.00 ❗ |
| multithreaded_benchmark File Size | 0.97 MB |
0.96 MB |
1.01 ❗ |
| Multithreaded Pi Efficiency - 2 Threads | 91.52 % (±8.68 %) |
91.38 % (±7.84 %) |
1.00 |
| Multithreaded Pi Efficiency - 4 Threads | 44.45 % (±3.82 %) |
44.41 % (±3.31 %) |
1.00 |
| Multithreaded Pi Efficiency - 8 Threads | 26.17 % (±1.71 %) |
25.72 % (±1.88 %) |
1.02 |
| micro_benchmarks Build Time | 110.64 s |
112.45 s |
0.98 ❗ |
| micro_benchmarks File Size | 0.97 MB |
0.96 MB |
1.01 ❗ |
| Scheduling time - 1 thread | 71.18 ticks (±3.56 ticks) |
71.54 ticks (±3.61 ticks) |
0.99 |
| Scheduling time - 2 threads | 41.34 ticks (±6.11 ticks) |
39.74 ticks (±4.96 ticks) |
1.04 |
| Micro - Time for syscall (getpid) | 3.10 ticks (±0.28 ticks) |
3.06 ticks (±0.30 ticks) |
1.01 |
| Memcpy speed - (built_in) block size 4096 | 64380.42 MByte/s (±46026.29 MByte/s) |
64053.45 MByte/s (±45682.00 MByte/s) |
1.01 |
| Memcpy speed - (built_in) block size 1048576 | 29563.33 MByte/s (±24631.69 MByte/s) |
29971.18 MByte/s (±24961.55 MByte/s) |
0.99 |
| Memcpy speed - (built_in) block size 16777216 | 21566.55 MByte/s (±18027.43 MByte/s) |
25003.45 MByte/s (±21013.82 MByte/s) |
0.86 |
| Memset speed - (built_in) block size 4096 | 65283.58 MByte/s (±46683.59 MByte/s) |
64316.32 MByte/s (±45851.10 MByte/s) |
1.02 |
| Memset speed - (built_in) block size 1048576 | 30322.83 MByte/s (±25056.54 MByte/s) |
30752.34 MByte/s (±25389.93 MByte/s) |
0.99 |
| Memset speed - (built_in) block size 16777216 | 22282.76 MByte/s (±18542.65 MByte/s) |
25753.82 MByte/s (±21487.22 MByte/s) |
0.87 |
| Memcpy speed - (rust) block size 4096 | 56785.63 MByte/s (±41866.87 MByte/s) |
59798.08 MByte/s (±44107.11 MByte/s) |
0.95 |
| Memcpy speed - (rust) block size 1048576 | 29478.00 MByte/s (±24564.58 MByte/s) |
29689.71 MByte/s (±24867.97 MByte/s) |
0.99 |
| Memcpy speed - (rust) block size 16777216 | 21453.24 MByte/s (±17887.44 MByte/s) |
24799.81 MByte/s (±20847.11 MByte/s) |
0.87 |
| Memset speed - (rust) block size 4096 | 57415.51 MByte/s (±42321.51 MByte/s) |
60738.74 MByte/s (±44654.61 MByte/s) |
0.95 |
| Memset speed - (rust) block size 1048576 | 30249.83 MByte/s (±24996.10 MByte/s) |
30493.17 MByte/s (±25306.10 MByte/s) |
0.99 |
| Memset speed - (rust) block size 16777216 | 22020.93 MByte/s (±18235.81 MByte/s) |
25549.21 MByte/s (±21318.77 MByte/s) |
0.86 |
| alloc_benchmarks Build Time | 107.76 s |
103.81 s |
1.04 ❗ |
| alloc_benchmarks File Size | 0.95 MB |
0.93 MB |
1.01 ❗ |
| Allocations - Allocation success | 100.00 % |
100.00 % |
1 |
| Allocations - Deallocation success | 100.00 % |
100.00 % |
1 |
| Allocations - Pre-fail Allocations | 100.00 % |
100.00 % |
1 |
| Allocations - Average Allocation time | 9494.54 Ticks (±701.57 Ticks) |
11018.15 Ticks (±189.29 Ticks) |
0.86 ❗ |
| Allocations - Average Allocation time (no fail) | 9494.54 Ticks (±701.57 Ticks) |
11018.15 Ticks (±189.29 Ticks) |
0.86 ❗ |
| Allocations - Average Deallocation time | 3688.50 Ticks (±318.31 Ticks) |
1054.71 Ticks (±332.39 Ticks) |
3.50 ❗ |
| mutex_benchmark Build Time | 107.96 s |
109.97 s |
0.98 ❗ |
| mutex_benchmark File Size | 0.97 MB |
0.96 MB |
1.01 ❗ |
| Mutex Stress Test Average Time per Iteration - 1 Threads | 13.72 ns (±0.66 ns) |
13.32 ns (±0.68 ns) |
1.03 |
| Mutex Stress Test Average Time per Iteration - 2 Threads | 16.24 ns (±0.95 ns) |
15.94 ns (±0.83 ns) |
1.02 |
This comment was automatically generated by workflow using github-action-benchmark.
mkroening
left a comment
There was a problem hiding this comment.
I have investigated how buffering works in glibc and Rust.
// test.c
#include <stdio.h>
#include <unistd.h>
void wait(void) {
size_t n = 0x40000000;
for (size_t i = 0; i < n; ++i) {
asm("");
}
}
int main(void) {
// Does buffer
printf("printf");
wait();
printf(" does buffer\n");
wait();
// Does buffer
fprintf(stdout, "fprintf(stdout)");
wait();
fprintf(stdout, " does buffer\n");
wait();
// Does not buffer
fprintf(stderr, "fprintf(stderr)");
wait();
fprintf(stderr, " does not buffer\n");
wait();
// Does not buffer
write(STDOUT_FILENO, "write(STDOUT_FILENO)", 20);
wait();
write(STDOUT_FILENO, " does not buffer\n", 17);
wait();
// Does not buffer
write(STDERR_FILENO, "write(STDERR_FILENO)", 20);
wait();
write(STDERR_FILENO, " does not buffer\n", 17);
wait();
}Run with gcc test.c && ./a.out.
#!/usr/bin/env -S cargo +nightly -Zscript
---cargo
[dependencies]
log = "0.4"
env_logger = "0.11"
---
use std::io::{self, Write};
use log::error;
fn main() {
// Does buffer
print!("print!");
wait();
println!(" does buffer");
wait();
{
// Does buffer
let stdout = io::stdout();
let mut handle = stdout.lock();
handle.write_all(b"stdout").unwrap();
wait();
handle.write_all(b" does buffer\n").unwrap();
wait();
}
// Does not buffer
eprint!("eprint!");
wait();
eprintln!(" does not buffer");
wait();
{
// Does not buffer
let stderr = io::stderr();
let mut handle = stderr.lock();
handle.write_all(b"stderr").unwrap();
wait();
handle.write_all(b" does not buffer\n").unwrap();
wait();
}
env_logger::init();
// Does buffer
error!(
"log\n{}does\n{}buffer\n",
{
wait();
'1'
},
{
wait();
'2'
}
);
}
fn wait() {
let n = 0x4000000_u64;
for _ in 0..n {
std::hint::spin_loop();
}
}Run with chmod +x test.rs && ./test.rs.
These results seem sensible to me, and I think we should try to reproduce them with Hermit.
Specifically, this means that sys_write calls should never buffer. Currently, sys_write(STDOUT_FILENO) and sys_write(STDERR_FILENO) are buffered because we run them through the same console buffer, which we should change. fprintf(stdout) (printf), stdout (print!), and log buffering should happen in userspace. All stdout and stderr output (especially non-text-based output) via sys_write should never be subject to text-based buffering.
Inside the kernel, we should adopt the userspace logic as necessary.
That means we should remove the buffering from the console abstraction and move them into kernel-internal stdout and stderr abstractions that buffer stdout appropriately and don't buffer stderr. For stdout buffering, we should always flush on newlines. We could use a buffer of 8192 there. Glibc defines stdio.h's BUFSIZ as 8192. Rust defines sys::io's DEFAULT_BUF_SIZE as 8 * 1024. Musl defines stdio.h's BUFSIZ as 1024.
For logging, we should add another buffer in our logging implementation similar to env_logger/src/writer/buffer.rs.
Increase serial buffer from 256 to 2048 bytes and gate newline-triggered flush on debug_assertions. In release builds, data is only flushed when the buffer is full or explicitly at the end of each _print()/_panic_print() call. This reduces VM exits from ~28 per multi-line info!() to just 1.
I cherry-pick a PR from @jbreitbart's repository.