How I Improved My Rust Compile Times

An engineer in a high vis vest and hard hat with a clipboard inspecing a ridiculously sized pyramid of crates in a warehouse

If you're looking for a talented Rust developer, or a good senior software engineer, please reach out. I'm looking for a full time role doing something neat, and am available for contract gigs. Contact info is in the footer. Thank you!

First off, I want to say that I've gotten a tremendous amount of feedback on the previous post in this series. People have sent me stories, typos, feedback, and other ideas for things to test in the future. Today is a smaller update to the original post, and I'd encourage you to take a look at that post for descriptions and background on most things I'll discuss here.

The day after I posted originally, the Rust folks published this blog post announcing the release of a parallelizable Rust frontend for nightly, which can offer some nice speed boosts. I also read about how workspaces are supposed to compile faster, and since Leptos supports that configuration, I was curious if it would materially change the results

Methodology

I'm sticking to the same methodology outlined in part one. Two warmup runs, 6 regular runs per test option. Since I've automated testing this at this point, I reran every configuration from part one. This time I only have my site as a test candidate, although that may change in the future.

New Configurations

Workspace

Originally my project was structured as a single crate, with different modules for different tasks, and a variety of cfg_if blocks to limit compiling of some parts for the server binary build step and others for the client webassembly build step. The workspace configuration seperates some of that into seperate crates, and reduces the number of cfg_if statements. However, since my site is server rendered and hydrated, and I use isomorphic server functions, the app crate still needs to be buildable for both. I also ended up having to break out several modules(errors and database models) into their own crates to avoid dependency cycles, and had to setup features along with cfg_if statements to change what is available when.

We can view of what that each looks like directory-wise with tree. tree is a pretty handy unix tool that'll let you pring the directory/file tree of any folder in a human readable way. It can even use .gitignore files, and has a number of nice options.

bash

tree --gitignore -d # Ignore files/folders in my gitignore and only show me directories

Here was the project structure as it was in part one, as a single crate.

bash

├── benchmarks
├── db
├── migrations
├── public
│   ├── fonts
│   └── img
├── scripts
├── src
│   ├── components
│   ├── functions
│   ├── layouts
│   ├── models
│   ├── providers
│   └── routes
│       └── blog
└── styles

Original Structure

Here is the workspace configuration I'll be testing today.

bash

├── app
│   └── src
│       ├── components
│       ├── functions
│       ├── layouts
│       ├── providers
│       └── routes
│           └── blog
├── benwis_server
│   ├── migrations
│   └── src
├── db
├── errors
│   └── src
├── frontend
│   └── src
├── models
│   └── src
├── public
│   ├── fonts  
│   └── img
├── scripts
└──styles

Structure as a Workspace

The theory as to why this may compile faster is that the cargo compiler operates at the crate level with each crate being a unit of compilation and linking. The compiler typically builds one crate per thread, and if only one crate is changed(and doesn't depend on or is depended on by the changed crate), the others may not have to recompile. I have only an outsider's perspective on how the compiler works, so if I've said something wrong here, please let me know.

I'm not sure that I buy the argument some workspace users have made that this is simpler. I now have five crates (app, benwis_server, errors, frontend, and models) instead of one. Perhaps it will compile faster though, we'll soon find out.

Parallel Compiler Front End

The Parallel Rustc Working Group has been hard at work attempting to parallelize different parts of the cargo compiler, to make things faster. The compiler could be seperated into two parts, the frontend and the backend. The frontend is primarily responsible for parsing, borrow checking, and type checking. The backend performs code generation via LLVM or Cranelift, and is already parallelized.

In the nightly build released Nov. 6, 2023. parallelization for the frontend was added as the defaul. However, the number of threads to open defaults to 0, which shouldn't net any benefit, but does let most of their code get tested. To enable more threads, we can set the rustflags like so RUSTFLAGS="-Z threads=8" cargo build I chose to instead add it to .cargo/config.toml

TOML markup

[build]
rustflags = ["-Z", "threads=8"]

Unlike cranelift, this should work for all profiles and all targets. To read more about what this is, how it works, and the state of parallelization in the Rust compiler, check out the blog post.

Results

Woah, boy. Adding parallel as an option more than doubled the number of tests that needed to be run. Mathematically, I now have four elements to test, and the number of tests needed can be calculated as the number of subsets of a set with n elements is 2ⁿ-1 for the empty set. So we went from 2³-1=7 to 2⁴-1=15 times two for test types and times two more for disk types, so 28 and 60 tests respectively.

So the first question is whether changing to a workspace made a signifigant difference in compile times.

Difference Between Baselines for Workspace and Single Crate
Cargo Configuration	mean(s)	std dev(s)	Delta From Baseline	% Delta From Baseline
Single Crate Clean Build NVME	80.06	0.27	0.00	0
Single Crate Clean Build 7200RPM	88.39	1.82	0.00	0
Workspace Clean Build NVME	71.75	0.35	8.31	10
Workspace Clean Build 7200RPM	77.72	0.56	10.67	12
Single Crate Incremental Build NVME	19.10	0.08	0.00	0
Single Crate Incremental Build 7200RPM	20.46	0.80	0.00	0
Workspace Incremental Build NVME	12.87	0.17	6.23	33
Workspace Incremental Build 7200RPM	13.98	0.99	6.48	32

And it seems like it did. It improved clean build times by ~10% and incremental builds by ~30%, regardless of disk type. But will these improvements hold for our fastest configuration from before(mold and cranelift) and our new fastet(mold, cranelift, and parallel)?

Difference Between Fastest Configurations in Single Crate and Workspace
Cargo Configuration	mean(s)	std dev(s)	Delta From Baseline	% Delta From Baseline
Single Crate Clean Build With Cranelift, Mold NVME	59.78	0.19	0.00	0
Single Crate Clean Build With Cranelift, Mold 7200RPM	65.34	1.24	0.00	0
Workspace Clean Build With Parallel, Cranelift, Mold NVME	56.96	0.26	2.82	5
Workspace Clean Build With Parallel, Cranelift, Mold 7200RPM	60.36	0.51	1.49	8
Single Crate Incremental Build With Cranelift, Mold NVME	4.65	0.07	0.00	0
Single Crate Incremental Build With Cranelift, Mold 7200RPM	4.83	0.25	0.00	0
Workspace Incremental Build With Parallel, Cranelift, Mold NVME	4.05	0.04	0.60	13
Workspace Incremental Build With Parallel, Cranelift, Mold 7200RPM	4.33	0.22	0.50	10

Also yes, it seems like we netted some nice boosts to these as well, of ~6% to clean build times and ~11% to incremental build times. I wonder if a more complex site would have a larger benefit, although I'm not sure I can convince Houski to do that work. If anyone else wants to try these tests on their rust app, let me know. I'd love more datapoints. As it stands, I'm not sure the difference is signifigant enough to warrant the effort of rewriting your app.

But Ben, you compared the parallel compiler for the workspace with the single crate runs that did not. How do you know the Parallel Frontend isn't responsible for the improvements?

That I did, let's talk about the parallel frontend.

Difference Between Parallel Frontend Enabled and Baseline
Cargo Configuration	mean(s)	std dev(s)	Delta From Baseline	% Delta From Baseline
Default Clean Build NVME	71.75	0.35	0	0
Default Clean Build 7200RPM	77.72	0.56	0	0
Parallel Clean Build NVME	74.94	0.34	-3.19	-4
Parallel Clean Build 7200RPM	78.6	0.34	-0.88	-1
Default Incremental Build NVME	12.87	0.17	0	0
Default Incremental Build 7200RPM	13.98	0.99	0	0
Parallel Incremental Build NVME	13.18	0.09	-0.31	-2
Parallel Incremental Build 7200RPM	14.07	0.84	-0.09	-1

Compared to the baseline, the parallel frontend showed some slight increases in compile time compared to the baseline, of between 1 and 4 percent. So it's quite unlikely that the parallel frontend is responsible for the differences, considering we already know the workspace builds are faster by a more signifigant margin. At best, it seems like the parallel frontend in this project is a wash. However, it's impossible to totally discount some interaction between parallel, mold, and cranelift here, so....

Conclusion

This time around, the improvement is a lot smaller, and more work to implement. It'll be up to individual developers and teams whether a build speed increase of 10% for clean builds and 6% for incremental is worth the cost of refactoring. For smaller sites like my blog, I probably wouldn't bother(if I hadn't already), as it works out to around 0.1 seconds per incremental compile. As for the parallel frontend, it's easy to enable since I'm already on nightly, and is likely to improve. I will probably leave it enabled, as it doesn't seem to hurt anything and it's good to give the new frontend some testing.

For those curious, adding on the improvements here, compared to the original single crate incremental compile baseline in the previous post, we've jumped from 76% to 79% faster. This is probably where I'll stop for a while, unless somebody else comes up with even more potential compile time decreases. Members of the Rust compiler team are certainly working to reduce these times, so who knows what the future holds. At the end of the day, I'm happy with the results. As always, feel free to reach out, either by email or on mastodon at @[email protected] for any questions, comments, or ideas. Have a great week!

Raw Data

Cargo Configuration	mean(s)	std dev(s)	Delta From Baseline	% Delta From Baseline
clean_0_nvme	71.75	0.35	0	0
clean_0_spin	77.72	0.56	0	0
clean_cranelift_nvme	61.83	0.44	9.92	14
clean_cranelift_spin	61.85	0.28	15.87	20
clean_mold_cranelift_nvme	59.47	0.63	12.28	17
clean_mold_cranelift_spin	60.32	0.77	17.4	22
clean_mold_nvme	63.7	0.44	8.05	11
clean_mold_o3_cranelift_nvme	102.66	1.13	-30.91	-43
clean_mold_o3_cranelift_spin	103.14	0.69	-25.42	-33
clean_mold_o3_nvme	154.69	2.25	-82.94	-116
clean_mold_o3_spin	154.21	0.8	-76.49	-98
clean_mold_spin	67.12	0.74	10.6	14
clean_o3_cranelift_nvme	105.5	0.34	-33.75	-47
clean_o3_cranelift_spin	105.31	0.48	-27.59	-35
clean_o3_nvme	160.75	0.48	-89	-124
clean_o3_spin	163.23	0.21	-85.51	-110
clean_parallel_cranelift_mold_nvme	56.96	0.26	14.79	21
clean_parallel_cranelift_mold_spin	60.36	0.51	17.36	22
clean_parallel_cranelift_nvme	58.49	0.15	13.26	18
clean_parallel_cranelift_spin	62.21	0.3	15.51	20
clean_parallel_mold_nvme	62.59	0.16	9.16	13
clean_parallel_mold_spin	68.73	1.51	8.99	12
clean_parallel_nvme	74.94	0.34	-3.19	-4
clean_parallel_o3_cranelift_mold_nvme	99.5	0.28	-27.75	-39
clean_parallel_o3_cranelift_mold_spin	103.19	0.54	-25.47	-33
clean_parallel_o3_cranelift_nvme	101.95	0.46	-30.2	-42
clean_parallel_o3_cranelift_spin	105.9	0.32	-28.18	-36
clean_parallel_o3_nvme	158.78	2.28	-87.03	-121
clean_parallel_o3_spin	164.63	1.13	-86.91	-112
clean_parallel_spin	78.6	0.34	-0.88	-1

Cargo Configuration	mean(s)	std dev(s)	Delta From Baseline	% Delta From Baseline
incremental_0_nvme	12.87	0.17	0	0
incremental_0_spin	13.98	0.99	0	0
incremental_cranelift_nvme	6.75	0.09	6.12	48
incremental_cranelift_spin	6.84	0.2	7.15	51
incremental_mold_cranelift_nvme	4.16	0.15	8.71	68
incremental_mold_cranelift_spin	4.31	0.22	9.67	69
incremental_mold_nvme	4.5	0.07	8.37	65
incremental_mold_o3_cranelift_nvme	4.2	0.06	8.67	67
incremental_mold_o3_cranelift_spin	4.52	0.34	9.46	68
incremental_mold_o3_nvme	4.58	0.05	8.29	64
incremental_mold_o3_spin	5.06	0.09	8.92	64
incremental_mold_spin	4.76	0.24	9.23	66
incremental_o3_cranelift_nvme	6.7	0.04	6.17	48
incremental_o3_cranelift_spin	7.07	0.41	6.92	49
incremental_o3_nvme	10.67	0.11	2.2	17
incremental_o3_spin	12.48	1.21	1.5	11
incremental_parallel_cranelift_mold_nvme	4.05	0.04	8.83	69
incremental_parallel_cranelift_mold_spin	4.33	0.22	9.65	69
incremental_parallel_cranelift_nvme	6.52	0.07	6.36	49
incremental_parallel_cranelift_spin	6.85	0.24	7.14	51
incremental_parallel_mold_nvme	4.52	0.06	8.36	65
incremental_parallel_mold_spin	4.94	0.35	9.05	65
incremental_parallel_nvme	13.18	0.09	-0.3	-2
incremental_parallel_o3_cranelift_mold_nvme	4.09	0.05	8.78	68
incremental_parallel_o3_cranelift_mold_spin	4.49	0.34	9.49	68
incremental_parallel_o3_cranelift_nvme	6.6	0.09	6.27	49
incremental_parallel_o3_cranelift_spin	7.1	0.38	6.89	49
incremental_parallel_o3_nvme	10.41	0.14	2.46	19
incremental_parallel_o3_spin	12.35	1.39	1.63	12
incremental_parallel_spin	14.07	0.84	-0.09	-1

How I Improved My Rust Compile Times - Part 2

Contents

Methodology

New Configurations

Workspace

Parallel Compiler Front End

Results

Conclusion

Raw Data

Previous

How I Improved My Rust Compile Times by 75%

Next

Compiling Rust to WASI