If you're looking for a talented Rust developer, or a good senior software engineer, please reach out. I'm looking for a full time role doing something neat, and am available for contract gigs. Contact info is in the footer. Thank you!
First off, I want to say that I've gotten a tremendous amount of feedback on the previous post in this series. People have sent me stories, typos, feedback, and other ideas for things to test in the future. Today is a smaller update to the original post, and I'd encourage you to take a look at that post for descriptions and background on most things I'll discuss here.
The day after I posted originally, the Rust folks published this blog post announcing the release of a parallelizable Rust frontend for nightly, which can offer some nice speed boosts. I also read about how workspaces are supposed to compile faster, and since Leptos supports that configuration, I was curious if it would materially change the results
Methodology
I'm sticking to the same methodology outlined in part one. Two warmup runs, 6 regular runs per test option. Since I've automated testing this at this point, I reran every configuration from part one. This time I only have my site as a test candidate, although that may change in the future.
New Configurations
Workspace
Originally my project was structured as a single crate, with different modules for different tasks, and a variety of cfg_if blocks to limit compiling of some parts for the server binary build step and others for the client webassembly build step. The workspace configuration seperates some of that into seperate crates, and reduces the number of cfg_if statements. However, since my site is server rendered and hydrated, and I use isomorphic server functions, the app crate still needs to be buildable for both. I also ended up having to break out several modules(errors and database models) into their own crates to avoid dependency cycles, and had to setup features along with cfg_if statements to change what is available when.
We can view of what that each looks like directory-wise with tree
. tree
is a pretty handy unix tool that'll let you pring the directory/file tree of any folder in a human readable way. It can even use .gitignore
files, and has a number of nice options.
tree --gitignore -d # Ignore files/folders in my gitignore and only show me directories
Here was the project structure as it was in part one, as a single crate.
├── benchmarks ├── db ├── migrations ├── public │ ├── fonts │ └── img ├── scripts ├── src │ ├── components │ ├── functions │ ├── layouts │ ├── models │ ├── providers │ └── routes │ └── blog └── styles
Original Structure
Here is the workspace configuration I'll be testing today.
├── app │ └── src │ ├── components │ ├── functions │ ├── layouts │ ├── providers │ └── routes │ └── blog ├── benwis_server │ ├── migrations │ └── src ├── db ├── errors │ └── src ├── frontend │ └── src ├── models │ └── src ├── public │ ├── fonts │ └── img ├── scripts └──styles
Structure as a Workspace
The theory as to why this may compile faster is that the cargo compiler operates at the crate level with each crate being a unit of compilation and linking. The compiler typically builds one crate per thread, and if only one crate is changed(and doesn't depend on or is depended on by the changed crate), the others may not have to recompile. I have only an outsider's perspective on how the compiler works, so if I've said something wrong here, please let me know.
I'm not sure that I buy the argument some workspace users have made that this is simpler. I now have five crates (app, benwis_server, errors, frontend, and models) instead of one. Perhaps it will compile faster though, we'll soon find out.
Parallel Compiler Front End
The Parallel Rustc Working Group has been hard at work attempting to parallelize different parts of the cargo compiler, to make things faster. The compiler could be seperated into two parts, the frontend and the backend. The frontend is primarily responsible for parsing, borrow checking, and type checking. The backend performs code generation via LLVM or Cranelift, and is already parallelized.
In the nightly build released Nov. 6, 2023. parallelization for the frontend was added as the defaul. However, the number of threads to open defaults to 0, which shouldn't net any benefit, but does let most of their code get tested. To enable more threads, we can set the rustflags like so RUSTFLAGS="-Z threads=8" cargo build
I chose to instead add it to .cargo/config.toml
[build] rustflags = ["-Z", "threads=8"]
Unlike cranelift, this should work for all profiles and all targets. To read more about what this is, how it works, and the state of parallelization in the Rust compiler, check out the blog post.
Results
Woah, boy. Adding parallel as an option more than doubled the number of tests that needed to be run. Mathematically, I now have four elements to test, and the number of tests needed can be calculated as the number of subsets of a set with n elements is 2ⁿ-1 for the empty set. So we went from 23-1=7 to 24-1=15 times two for test types and times two more for disk types, so 28 and 60 tests respectively.
So the first question is whether changing to a workspace made a signifigant difference in compile times.
Difference Between Baselines for Workspace and Single Crate | ||||
---|---|---|---|---|
Cargo Configuration | mean(s) | std dev(s) | Delta From Baseline | % Delta From Baseline |
Single Crate Clean Build NVME | 80.06 | 0.27 | 0.00 | 0 |
Single Crate Clean Build 7200RPM | 88.39 | 1.82 | 0.00 | 0 |
Workspace Clean Build NVME | 71.75 | 0.35 | 8.31 | 10 |
Workspace Clean Build 7200RPM | 77.72 | 0.56 | 10.67 | 12 |
Single Crate Incremental Build NVME | 19.10 | 0.08 | 0.00 | 0 |
Single Crate Incremental Build 7200RPM | 20.46 | 0.80 | 0.00 | 0 |
Workspace Incremental Build NVME | 12.87 | 0.17 | 6.23 | 33 |
Workspace Incremental Build 7200RPM | 13.98 | 0.99 | 6.48 | 32 |
And it seems like it did. It improved clean build times by ~10% and incremental builds by ~30%, regardless of disk type. But will these improvements hold for our fastest configuration from before(mold and cranelift) and our new fastet(mold, cranelift, and parallel)?
Difference Between Fastest Configurations in Single Crate and Workspace | ||||
---|---|---|---|---|
Cargo Configuration | mean(s) | std dev(s) | Delta From Baseline | % Delta From Baseline |
Single Crate Clean Build With Cranelift, Mold NVME | 59.78 | 0.19 | 0.00 | 0 |
Single Crate Clean Build With Cranelift, Mold 7200RPM | 65.34 | 1.24 | 0.00 | 0 |
Workspace Clean Build With Parallel, Cranelift, Mold NVME | 56.96 | 0.26 | 2.82 | 5 |
Workspace Clean Build With Parallel, Cranelift, Mold 7200RPM | 60.36 | 0.51 | 1.49 | 8 |
Single Crate Incremental Build With Cranelift, Mold NVME | 4.65 | 0.07 | 0.00 | 0 |
Single Crate Incremental Build With Cranelift, Mold 7200RPM | 4.83 | 0.25 | 0.00 | 0 |
Workspace Incremental Build With Parallel, Cranelift, Mold NVME | 4.05 | 0.04 | 0.60 | 13 |
Workspace Incremental Build With Parallel, Cranelift, Mold 7200RPM | 4.33 | 0.22 | 0.50 | 10 |
Also yes, it seems like we netted some nice boosts to these as well, of ~6% to clean build times and ~11% to incremental build times. I wonder if a more complex site would have a larger benefit, although I'm not sure I can convince Houski to do that work. If anyone else wants to try these tests on their rust app, let me know. I'd love more datapoints. As it stands, I'm not sure the difference is signifigant enough to warrant the effort of rewriting your app.
But Ben, you compared the parallel compiler for the workspace with the single crate runs that did not. How do you know the Parallel Frontend isn't responsible for the improvements?
That I did, let's talk about the parallel frontend.
Difference Between Parallel Frontend Enabled and Baseline | ||||
---|---|---|---|---|
Cargo Configuration | mean(s) | std dev(s) | Delta From Baseline | % Delta From Baseline |
Default Clean Build NVME | 71.75 | 0.35 | 0 | 0 |
Default Clean Build 7200RPM | 77.72 | 0.56 | 0 | 0 |
Parallel Clean Build NVME | 74.94 | 0.34 | -3.19 | -4 |
Parallel Clean Build 7200RPM | 78.6 | 0.34 | -0.88 | -1 |
Default Incremental Build NVME | 12.87 | 0.17 | 0 | 0 |
Default Incremental Build 7200RPM | 13.98 | 0.99 | 0 | 0 |
Parallel Incremental Build NVME | 13.18 | 0.09 | -0.31 | -2 |
Parallel Incremental Build 7200RPM | 14.07 | 0.84 | -0.09 | -1 |
Compared to the baseline, the parallel frontend showed some slight increases in compile time compared to the baseline, of between 1 and 4 percent. So it's quite unlikely that the parallel frontend is responsible for the differences, considering we already know the workspace builds are faster by a more signifigant margin. At best, it seems like the parallel frontend in this project is a wash. However, it's impossible to totally discount some interaction between parallel, mold, and cranelift here, so....
Conclusion
This time around, the improvement is a lot smaller, and more work to implement. It'll be up to individual developers and teams whether a build speed increase of 10% for clean builds and 6% for incremental is worth the cost of refactoring. For smaller sites like my blog, I probably wouldn't bother(if I hadn't already), as it works out to around 0.1 seconds per incremental compile. As for the parallel frontend, it's easy to enable since I'm already on nightly, and is likely to improve. I will probably leave it enabled, as it doesn't seem to hurt anything and it's good to give the new frontend some testing.
For those curious, adding on the improvements here, compared to the original single crate incremental compile baseline in the previous post, we've jumped from 76% to 79% faster. This is probably where I'll stop for a while, unless somebody else comes up with even more potential compile time decreases. Members of the Rust compiler team are certainly working to reduce these times, so who knows what the future holds. At the end of the day, I'm happy with the results. As always, feel free to reach out, either by email or on mastodon at @[email protected] for any questions, comments, or ideas. Have a great week!
Raw Data
Cargo Configuration | mean(s) | std dev(s) | Delta From Baseline | % Delta From Baseline |
---|---|---|---|---|
clean_0_nvme | 71.75 | 0.35 | 0 | 0 |
clean_0_spin | 77.72 | 0.56 | 0 | 0 |
clean_cranelift_nvme | 61.83 | 0.44 | 9.92 | 14 |
clean_cranelift_spin | 61.85 | 0.28 | 15.87 | 20 |
clean_mold_cranelift_nvme | 59.47 | 0.63 | 12.28 | 17 |
clean_mold_cranelift_spin | 60.32 | 0.77 | 17.4 | 22 |
clean_mold_nvme | 63.7 | 0.44 | 8.05 | 11 |
clean_mold_o3_cranelift_nvme | 102.66 | 1.13 | -30.91 | -43 |
clean_mold_o3_cranelift_spin | 103.14 | 0.69 | -25.42 | -33 |
clean_mold_o3_nvme | 154.69 | 2.25 | -82.94 | -116 |
clean_mold_o3_spin | 154.21 | 0.8 | -76.49 | -98 |
clean_mold_spin | 67.12 | 0.74 | 10.6 | 14 |
clean_o3_cranelift_nvme | 105.5 | 0.34 | -33.75 | -47 |
clean_o3_cranelift_spin | 105.31 | 0.48 | -27.59 | -35 |
clean_o3_nvme | 160.75 | 0.48 | -89 | -124 |
clean_o3_spin | 163.23 | 0.21 | -85.51 | -110 |
clean_parallel_cranelift_mold_nvme | 56.96 | 0.26 | 14.79 | 21 |
clean_parallel_cranelift_mold_spin | 60.36 | 0.51 | 17.36 | 22 |
clean_parallel_cranelift_nvme | 58.49 | 0.15 | 13.26 | 18 |
clean_parallel_cranelift_spin | 62.21 | 0.3 | 15.51 | 20 |
clean_parallel_mold_nvme | 62.59 | 0.16 | 9.16 | 13 |
clean_parallel_mold_spin | 68.73 | 1.51 | 8.99 | 12 |
clean_parallel_nvme | 74.94 | 0.34 | -3.19 | -4 |
clean_parallel_o3_cranelift_mold_nvme | 99.5 | 0.28 | -27.75 | -39 |
clean_parallel_o3_cranelift_mold_spin | 103.19 | 0.54 | -25.47 | -33 |
clean_parallel_o3_cranelift_nvme | 101.95 | 0.46 | -30.2 | -42 |
clean_parallel_o3_cranelift_spin | 105.9 | 0.32 | -28.18 | -36 |
clean_parallel_o3_nvme | 158.78 | 2.28 | -87.03 | -121 |
clean_parallel_o3_spin | 164.63 | 1.13 | -86.91 | -112 |
clean_parallel_spin | 78.6 | 0.34 | -0.88 | -1 |
Cargo Configuration | mean(s) | std dev(s) | Delta From Baseline | % Delta From Baseline |
---|---|---|---|---|
incremental_0_nvme | 12.87 | 0.17 | 0 | 0 |
incremental_0_spin | 13.98 | 0.99 | 0 | 0 |
incremental_cranelift_nvme | 6.75 | 0.09 | 6.12 | 48 |
incremental_cranelift_spin | 6.84 | 0.2 | 7.15 | 51 |
incremental_mold_cranelift_nvme | 4.16 | 0.15 | 8.71 | 68 |
incremental_mold_cranelift_spin | 4.31 | 0.22 | 9.67 | 69 |
incremental_mold_nvme | 4.5 | 0.07 | 8.37 | 65 |
incremental_mold_o3_cranelift_nvme | 4.2 | 0.06 | 8.67 | 67 |
incremental_mold_o3_cranelift_spin | 4.52 | 0.34 | 9.46 | 68 |
incremental_mold_o3_nvme | 4.58 | 0.05 | 8.29 | 64 |
incremental_mold_o3_spin | 5.06 | 0.09 | 8.92 | 64 |
incremental_mold_spin | 4.76 | 0.24 | 9.23 | 66 |
incremental_o3_cranelift_nvme | 6.7 | 0.04 | 6.17 | 48 |
incremental_o3_cranelift_spin | 7.07 | 0.41 | 6.92 | 49 |
incremental_o3_nvme | 10.67 | 0.11 | 2.2 | 17 |
incremental_o3_spin | 12.48 | 1.21 | 1.5 | 11 |
incremental_parallel_cranelift_mold_nvme | 4.05 | 0.04 | 8.83 | 69 |
incremental_parallel_cranelift_mold_spin | 4.33 | 0.22 | 9.65 | 69 |
incremental_parallel_cranelift_nvme | 6.52 | 0.07 | 6.36 | 49 |
incremental_parallel_cranelift_spin | 6.85 | 0.24 | 7.14 | 51 |
incremental_parallel_mold_nvme | 4.52 | 0.06 | 8.36 | 65 |
incremental_parallel_mold_spin | 4.94 | 0.35 | 9.05 | 65 |
incremental_parallel_nvme | 13.18 | 0.09 | -0.3 | -2 |
incremental_parallel_o3_cranelift_mold_nvme | 4.09 | 0.05 | 8.78 | 68 |
incremental_parallel_o3_cranelift_mold_spin | 4.49 | 0.34 | 9.49 | 68 |
incremental_parallel_o3_cranelift_nvme | 6.6 | 0.09 | 6.27 | 49 |
incremental_parallel_o3_cranelift_spin | 7.1 | 0.38 | 6.89 | 49 |
incremental_parallel_o3_nvme | 10.41 | 0.14 | 2.46 | 19 |
incremental_parallel_o3_spin | 12.35 | 1.39 | 1.63 | 12 |
incremental_parallel_spin | 14.07 | 0.84 | -0.09 | -1 |