An engineer in a high vis vest and hard hat with a clipboard inspecing a ridiculously sized pyramid of crates in a warehouseAn engineer in a high vis vest and hard hat with a clipboard inspecing a ridiculously sized pyramid of crates in a warehouse

If you're looking for a talented Rust developer, or a good senior software engineer, please reach out. I'm looking for a full time role doing something neat, and am available for contract gigs. Contact info is in the footer. Thank you!

First off, I want to say that I've gotten a tremendous amount of feedback on the previous post in this series. People have sent me stories, typos, feedback, and other ideas for things to test in the future. Today is a smaller update to the original post, and I'd encourage you to take a look at that post for descriptions and background on most things I'll discuss here.

The day after I posted originally, the Rust folks published this blog post announcing the release of a parallelizable Rust frontend for nightly, which can offer some nice speed boosts. I also read about how workspaces are supposed to compile faster, and since Leptos supports that configuration, I was curious if it would materially change the results

Methodology

I'm sticking to the same methodology outlined in part one. Two warmup runs, 6 regular runs per test option. Since I've automated testing this at this point, I reran every configuration from part one. This time I only have my site as a test candidate, although that may change in the future.

New Configurations

Workspace

Originally my project was structured as a single crate, with different modules for different tasks, and a variety of cfg_if blocks to limit compiling of some parts for the server binary build step and others for the client webassembly build step. The workspace configuration seperates some of that into seperate crates, and reduces the number of cfg_if statements. However, since my site is server rendered and hydrated, and I use isomorphic server functions, the app crate still needs to be buildable for both. I also ended up having to break out several modules(errors and database models) into their own crates to avoid dependency cycles, and had to setup features along with cfg_if statements to change what is available when.

We can view of what that each looks like directory-wise with tree. tree is a pretty handy unix tool that'll let you pring the directory/file tree of any folder in a human readable way. It can even use .gitignore files, and has a number of nice options.

bash
tree --gitignore -d # Ignore files/folders in my gitignore and only show me directories

Here was the project structure as it was in part one, as a single crate.

bash
├── benchmarks
├── db
├── migrations
├── public
│   ├── fonts
│   └── img
├── scripts
├── src
│   ├── components
│   ├── functions
│   ├── layouts
│   ├── models
│   ├── providers
│   └── routes
│       └── blog
└── styles

Original Structure

Here is the workspace configuration I'll be testing today.

bash
├── app
│   └── src
│       ├── components
│       ├── functions
│       ├── layouts
│       ├── providers
│       └── routes
│           └── blog
├── benwis_server
│   ├── migrations
│   └── src
├── db
├── errors
│   └── src
├── frontend
│   └── src
├── models
│   └── src
├── public
│   ├── fonts  
│   └── img
├── scripts
└──styles

Structure as a Workspace

The theory as to why this may compile faster is that the cargo compiler operates at the crate level with each crate being a unit of compilation and linking. The compiler typically builds one crate per thread, and if only one crate is changed(and doesn't depend on or is depended on by the changed crate), the others may not have to recompile. I have only an outsider's perspective on how the compiler works, so if I've said something wrong here, please let me know.

I'm not sure that I buy the argument some workspace users have made that this is simpler. I now have five crates (app, benwis_server, errors, frontend, and models) instead of one. Perhaps it will compile faster though, we'll soon find out.

Parallel Compiler Front End

The Parallel Rustc Working Group has been hard at work attempting to parallelize different parts of the cargo compiler, to make things faster. The compiler could be seperated into two parts, the frontend and the backend. The frontend is primarily responsible for parsing, borrow checking, and type checking. The backend performs code generation via LLVM or Cranelift, and is already parallelized.

In the nightly build released Nov. 6, 2023. parallelization for the frontend was added as the defaul. However, the number of threads to open defaults to 0, which shouldn't net any benefit, but does let most of their code get tested. To enable more threads, we can set the rustflags like so RUSTFLAGS="-Z threads=8" cargo build I chose to instead add it to .cargo/config.toml

TOML markup
[build]
rustflags = ["-Z", "threads=8"]

Unlike cranelift, this should work for all profiles and all targets. To read more about what this is, how it works, and the state of parallelization in the Rust compiler, check out the blog post.

Results

Woah, boy. Adding parallel as an option more than doubled the number of tests that needed to be run. Mathematically, I now have four elements to test, and the number of tests needed can be calculated as the number of subsets of a set with n elements is 2ⁿ-1 for the empty set. So we went from 23-1=7 to 24-1=15 times two for test types and times two more for disk types, so 28 and 60 tests respectively.

So the first question is whether changing to a workspace made a signifigant difference in compile times.

Difference Between Baselines for Workspace and Single Crate
Cargo Configurationmean(s)std dev(s)Delta From Baseline% Delta From Baseline
Single Crate Clean Build NVME80.060.270.000
Single Crate Clean Build 7200RPM88.391.820.000
Workspace Clean Build NVME71.750.358.3110
Workspace Clean Build 7200RPM77.720.5610.6712
Single Crate Incremental Build NVME19.100.080.000
Single Crate Incremental Build 7200RPM20.460.800.000
Workspace Incremental Build NVME12.870.176.2333
Workspace Incremental Build 7200RPM13.980.996.4832

And it seems like it did. It improved clean build times by ~10% and incremental builds by ~30%, regardless of disk type. But will these improvements hold for our fastest configuration from before(mold and cranelift) and our new fastet(mold, cranelift, and parallel)?

Difference Between Fastest Configurations in Single Crate and Workspace
Cargo Configurationmean(s)std dev(s)Delta From Baseline% Delta From Baseline
Single Crate Clean Build With Cranelift, Mold NVME59.780.190.000
Single Crate Clean Build With Cranelift, Mold 7200RPM65.341.240.000
Workspace Clean Build With Parallel, Cranelift, Mold NVME56.960.262.825
Workspace Clean Build With Parallel, Cranelift, Mold 7200RPM60.360.511.498
Single Crate Incremental Build With Cranelift, Mold NVME4.650.070.000
Single Crate Incremental Build With Cranelift, Mold 7200RPM4.830.250.000
Workspace Incremental Build With Parallel, Cranelift, Mold NVME4.050.040.6013
Workspace Incremental Build With Parallel, Cranelift, Mold 7200RPM4.330.220.5010

Also yes, it seems like we netted some nice boosts to these as well, of ~6% to clean build times and ~11% to incremental build times. I wonder if a more complex site would have a larger benefit, although I'm not sure I can convince Houski to do that work. If anyone else wants to try these tests on their rust app, let me know. I'd love more datapoints. As it stands, I'm not sure the difference is signifigant enough to warrant the effort of rewriting your app.

But Ben, you compared the parallel compiler for the workspace with the single crate runs that did not. How do you know the Parallel Frontend isn't responsible for the improvements?

That I did, let's talk about the parallel frontend.

Difference Between Parallel Frontend Enabled and Baseline
Cargo Configurationmean(s)std dev(s)Delta From Baseline% Delta From Baseline
Default Clean Build NVME71.750.3500
Default Clean Build 7200RPM77.720.5600
Parallel Clean Build NVME74.940.34-3.19-4
Parallel Clean Build 7200RPM78.60.34-0.88-1
Default Incremental Build NVME12.870.1700
Default Incremental Build 7200RPM13.980.9900
Parallel Incremental Build NVME13.180.09-0.31-2
Parallel Incremental Build 7200RPM14.070.84-0.09-1

Compared to the baseline, the parallel frontend showed some slight increases in compile time compared to the baseline, of between 1 and 4 percent. So it's quite unlikely that the parallel frontend is responsible for the differences, considering we already know the workspace builds are faster by a more signifigant margin. At best, it seems like the parallel frontend in this project is a wash. However, it's impossible to totally discount some interaction between parallel, mold, and cranelift here, so....

Conclusion

This time around, the improvement is a lot smaller, and more work to implement. It'll be up to individual developers and teams whether a build speed increase of 10% for clean builds and 6% for incremental is worth the cost of refactoring. For smaller sites like my blog, I probably wouldn't bother(if I hadn't already), as it works out to around 0.1 seconds per incremental compile. As for the parallel frontend, it's easy to enable since I'm already on nightly, and is likely to improve. I will probably leave it enabled, as it doesn't seem to hurt anything and it's good to give the new frontend some testing.

For those curious, adding on the improvements here, compared to the original single crate incremental compile baseline in the previous post, we've jumped from 76% to 79% faster. This is probably where I'll stop for a while, unless somebody else comes up with even more potential compile time decreases. Members of the Rust compiler team are certainly working to reduce these times, so who knows what the future holds. At the end of the day, I'm happy with the results. As always, feel free to reach out, either by email or on mastodon at @benwis@hachyderm.io for any questions, comments, or ideas. Have a great week!

Raw Data

Cargo Configurationmean(s)std dev(s)Delta From Baseline% Delta From Baseline
clean_0_nvme71.750.3500
clean_0_spin77.720.5600
clean_cranelift_nvme61.830.449.9214
clean_cranelift_spin61.850.2815.8720
clean_mold_cranelift_nvme59.470.6312.2817
clean_mold_cranelift_spin60.320.7717.422
clean_mold_nvme63.70.448.0511
clean_mold_o3_cranelift_nvme102.661.13-30.91-43
clean_mold_o3_cranelift_spin103.140.69-25.42-33
clean_mold_o3_nvme154.692.25-82.94-116
clean_mold_o3_spin154.210.8-76.49-98
clean_mold_spin67.120.7410.614
clean_o3_cranelift_nvme105.50.34-33.75-47
clean_o3_cranelift_spin105.310.48-27.59-35
clean_o3_nvme160.750.48-89-124
clean_o3_spin163.230.21-85.51-110
clean_parallel_cranelift_mold_nvme56.960.2614.7921
clean_parallel_cranelift_mold_spin60.360.5117.3622
clean_parallel_cranelift_nvme58.490.1513.2618
clean_parallel_cranelift_spin62.210.315.5120
clean_parallel_mold_nvme62.590.169.1613
clean_parallel_mold_spin68.731.518.9912
clean_parallel_nvme74.940.34-3.19-4
clean_parallel_o3_cranelift_mold_nvme99.50.28-27.75-39
clean_parallel_o3_cranelift_mold_spin103.190.54-25.47-33
clean_parallel_o3_cranelift_nvme101.950.46-30.2-42
clean_parallel_o3_cranelift_spin105.90.32-28.18-36
clean_parallel_o3_nvme158.782.28-87.03-121
clean_parallel_o3_spin164.631.13-86.91-112
clean_parallel_spin78.60.34-0.88-1
Cargo Configurationmean(s)std dev(s)Delta From Baseline% Delta From Baseline
incremental_0_nvme12.870.1700
incremental_0_spin13.980.9900
incremental_cranelift_nvme6.750.096.1248
incremental_cranelift_spin6.840.27.1551
incremental_mold_cranelift_nvme4.160.158.7168
incremental_mold_cranelift_spin4.310.229.6769
incremental_mold_nvme4.50.078.3765
incremental_mold_o3_cranelift_nvme4.20.068.6767
incremental_mold_o3_cranelift_spin4.520.349.4668
incremental_mold_o3_nvme4.580.058.2964
incremental_mold_o3_spin5.060.098.9264
incremental_mold_spin4.760.249.2366
incremental_o3_cranelift_nvme6.70.046.1748
incremental_o3_cranelift_spin7.070.416.9249
incremental_o3_nvme10.670.112.217
incremental_o3_spin12.481.211.511
incremental_parallel_cranelift_mold_nvme4.050.048.8369
incremental_parallel_cranelift_mold_spin4.330.229.6569
incremental_parallel_cranelift_nvme6.520.076.3649
incremental_parallel_cranelift_spin6.850.247.1451
incremental_parallel_mold_nvme4.520.068.3665
incremental_parallel_mold_spin4.940.359.0565
incremental_parallel_nvme13.180.09-0.3-2
incremental_parallel_o3_cranelift_mold_nvme4.090.058.7868
incremental_parallel_o3_cranelift_mold_spin4.490.349.4968
incremental_parallel_o3_cranelift_nvme6.60.096.2749
incremental_parallel_o3_cranelift_spin7.10.386.8949
incremental_parallel_o3_nvme10.410.142.4619
incremental_parallel_o3_spin12.351.391.6312
incremental_parallel_spin14.070.84-0.09-1