Inline assembly in Rust

Inline assembly in Rust, specifically with the asm! macro, allows developers to insert assembly language instructions directly into Rust code, enabling finer control over hardware and optimizations that can be critical in systems programming, performance-critical code, or specific CPU instruction sets.

In this article, we’ll cover the basics of using the asm! macro in Rust, highlight its syntax, and showcase a working example.

Understanding Inline Assembly with `asm!`

The asm! macro allows Rust programs to embed assembly instructions inline with Rust code. Introduced as a nightly-only feature (as of Rust 1.49), asm! replaces the older llvm_asm! syntax, offering a more robust and flexible approach to inline assembly.

Note: To use asm!, your project must use the nightly compiler and enable the asm feature.

Enabling Inline Assembly in Rust

enable the asm! feature at the beginning of your Rust file:

#![feature(asm)]

Basic Syntax of `asm!`

The basic structure of asm! in Rust is:

asm!(
    "<assembly code>",
    options(<options>)
);

Assembly Code: This is a string literal containing the assembly instructions. You can pass arguments, outputs, and modify registers directly in the assembly code.
Options: Optional flags to control how the asm! behaves, such as preserving flags, nostack, etc.

A Simple `asm!` Example

Let’s create a simple example where we use asm! to add two numbers. The goal here is to use the add assembly instruction directly to demonstrate how inline assembly can interact with Rust variables.

#![feature(asm)]

fn main() {
    let mut result: u32;
    let x: u32 = 10;
    let y: u32 = 20;
    unsafe {
        asm!(
            "add {0}, {1}, {2}",
            out(reg) result,
            in(reg) x,
            in(reg) y,
        );
    }
    println!("The result of {} + {} is {}", x, y, result);
}

out(reg) result: Specifies that result is an output variable, which will hold the result of the addition.
in(reg) x and in(reg) y: Specifies x and y as input registers for the add operation.
add {0}, {1}, {2}: This line uses the add assembly instruction, adding the values in x and y and storing the result in result.

Compiling and Running the Code

To compile and run this code, follow these steps:

Switch to Nightly: Ensure you’re using the nightly version of Rust.

rustup override set nightly

Run the Program:

cargo run

If everything works correctly, the output should be:

The result of 10 + 20 is 30

A More Advanced Example: Bitwise Operation with `asm!`

For a more complex example, let’s implement a bitwise XOR operation using inline assembly:

#![feature(asm)]

fn main() {
    let a: u32 = 0b1100;
    let b: u32 = 0b1010;
    let mut result: u32;
    unsafe {
        asm!(
            "xor {0}, {1}, {2}",
            out(reg) result,
            in(reg) a,
            in(reg) b,
        );
    }
    println!("The result of {} XOR {} is {:b}", a, b, result);
}

This example demonstrates the xor instruction, which performs a bitwise XOR operation. The output should be:

The result of 12 XOR 10 is 110

Let’s explore a real-world example that showcases both the utility and performance benefits of inline assembly. In this case, we’ll use inline assembly to implement a CPU cycle counter to measure the execution time of specific code segments in Rust. Measuring CPU cycles is crucial for profiling performance in embedded systems, high-frequency trading, cryptography, and other performance-critical applications.

Real-World Use Case: CPU Cycle Counter

Counting CPU cycles is essential for precise performance profiling, as it provides a direct measurement of how long code takes to execute at the CPU level. This approach is particularly useful in real-time systems where nanosecond precision is required, or in embedded systems where power and processing resources are limited.

Benefits of Using Inline Assembly

Precision: Inline assembly allows us to access the CPU’s time-stamp counter directly, providing a more accurate measure of time compared to standard functions like std::time::Instant.
Efficiency: Accessing the CPU counter through Rust code alone would involve more overhead than using specific CPU instructions like RDTSC (Read Time-Stamp Counter).
Platform-Specific Optimization: With inline assembly, we can leverage platform-specific instructions optimized for certain CPU architectures.

Implementing a Cycle Counter with Inline Assembly

In x86 and x86–64 architectures, we can use the RDTSC instruction to access the CPU’s time-stamp counter directly. Let’s see how to implement this using asm! in Rust.

#![feature(asm)]

/// Function to get the current CPU cycle count using the RDTSC instruction
fn get_cpu_cycles() -> u64 {
    let high: u32;
    let low: u32;
    unsafe {
        // Read the time-stamp counter into two 32-bit registers
        asm!(
            "rdtsc",
            out("eax") low,   // Lower 32 bits go into `low`
            out("edx") high   // Higher 32 bits go into `high`
        );
    }
    // Combine the high and low parts to get the full 64-bit counter
    ((high as u64) << 32) | (low as u64)
}

fn main() {
    // Measure CPU cycles taken for a sample code block
    let start = get_cpu_cycles();
    // Sample code block (e.g., complex calculation, simulation, etc.)
    let mut sum = 0;
    for i in 0..1_000_000 {
        sum += i;
    }
    let end = get_cpu_cycles();
    println!("The sum is: {}", sum);
    println!("CPU cycles taken: {}", end - start);
}

get_cpu_cycles(): This function uses the RDTSC (Read Time-Stamp Counter) instruction to retrieve the CPU's time-stamp counter, which counts the number of cycles since the last reset.

out("eax") low and out("edx") high specify output registers. In x86 assembly, RDTSC places the low 32 bits of the cycle count in EAX and the high 32 bits in EDX.

The high and low parts are combined into a single 64-bit value to represent the full cycle count.

Performance Measurement: The main function demonstrates a simple way to measure CPU cycles for a block of code. We capture the start cycle count before a loop and the end count after, allowing us to calculate the cycles taken for the loop.

Benefits of This Approach

High Precision: Using RDTSC provides a high-precision, low-overhead way to measure cycles, as it avoids the typical delays of OS-level timing functions.
Minimal Overhead: Accessing the time-stamp counter directly has almost zero overhead compared to higher-level abstractions, making it ideal for profiling short code blocks where every cycle counts.
Deterministic and Consistent: RDTSC reads directly from the CPU, so it's not affected by OS scheduling or thread preemption, making it more consistent for benchmarking purposes.

Enhanced Profiling: Using Inline Assembly for More Robust Performance Timing

In the following example, we’ll use both the RDTSC and RDTSCP instructions to count CPU cycles. RDTSC alone can be unreliable on modern multi-core processors since it doesn’t serialize CPU operations. Using RDTSCP addresses this by ensuring the instruction waits until all previous instructions have been executed, providing a more accurate cycle count.

Improved CPU Cycle Counter Example

The example below shows a cycle counter that uses both RDTSC at the beginning and RDTSCP at the end, ensuring a precise and isolated cycle count of a critical code block.

#![feature(asm)]

/// Function to retrieve CPU cycle count using `RDTSC` at the start and `RDTSCP` at the end
fn get_cpu_cycles_pair() -> (u64, u64) {
    let start_high: u32;
    let start_low: u32;
    let end_high: u32;
    let end_low: u32;
    unsafe {
        // Start cycle count
        asm!(
            "cpuid",          // Serialize to prevent out-of-order execution
            "rdtsc",          // Read time-stamp counter
            out("eax") start_low,   // Lower 32 bits
            out("edx") start_high,  // Higher 32 bits
            options(nostack) // Prevents stack pointer adjustments
        );
        // Critical code section goes here
        // (simulate work with a lightweight loop or function call)
        let mut sum = 0;
        for i in 0..1_000_000 {
            sum += i;
        }
        // End cycle count
        asm!(
            "rdtscp",          // Read time-stamp counter with ordering
            out("eax") end_low,   // Lower 32 bits
            out("edx") end_high,  // Higher 32 bits
            "cpuid",          // Serialize to prevent out-of-order execution
            options(nostack)
        );
    }
    // Combine high and low parts into a single 64-bit value
    let start_cycles = ((start_high as u64) << 32) | (start_low as u64);
    let end_cycles = ((end_high as u64) << 32) | (end_low as u64);
    (start_cycles, end_cycles)
}

fn main() {
    let (start, end) = get_cpu_cycles_pair();
    println!("CPU cycles taken: {}", end - start);
}

Serializing with cpuid: The cpuid instruction is used to prevent out-of-order execution, ensuring that all instructions before the RDTSC or RDTSCP call have completed. This is crucial in multi-core and high-performance environments to maintain accuracy.
Start and End Counters:

Start Counter (RDTSC): We call RDTSC at the beginning of the code block to capture the start cycle count.
End Counter (RDTSCP): RDTSCP at the end reads the counter with an inherent ordering, providing a more accurate end cycle count.

3. Critical Code Section: The code you want to measure (e.g., a loop) is placed between the two cycle counter instructions. In practice, this might be a cryptographic function, a data processing loop, or another performance-critical task.

Real-World Scenarios and Benefits

This method is highly beneficial in specific contexts:

Embedded Systems and Real-Time Applications: Precise cycle counting helps developers ensure that code execution times meet strict timing requirements, especially in systems where every microsecond counts, like automotive control units or medical devices.
Cryptographic Algorithms: Cycle-accurate profiling is essential in cryptography, where timing leaks can potentially expose information about secret data. Precise measurement ensures no unexpected performance bottlenecks or vulnerabilities.
High-Performance Trading: In financial systems, even minor delays can affect profitability. Cycle counting helps optimize latency-sensitive functions, like order matching or risk calculations.
Performance Optimization: For any CPU-intensive application, cycle-level measurement can reveal exactly which parts of the code consume the most resources, guiding targeted optimizations.

Extending Inline Assembly Usage with `RDTSC` and `RDTSCP` in Rust

Measuring the Impact of Different Code Blocks

Let’s extend our example by measuring two different code blocks to see how they compare in terms of CPU cycles. This technique is common in performance engineering, where you might want to assess the relative cost of different implementations or functions.

#![feature(asm)]

/// Function to retrieve CPU cycle count using `RDTSC` and `RDTSCP`
fn measure_code_cycles<F>(func: F) -> u64
where
    F: FnOnce(),
{
    let start_high: u32;
    let start_low: u32;
    let end_high: u32;
    let end_low: u32;
    unsafe {
        // Start cycle count
        asm!(
            "cpuid",
            "rdtsc",
            out("eax") start_low,
            out("edx") start_high,
            options(nostack)
        );
        // Run the passed function
        func();
        // End cycle count
        asm!(
            "rdtscp",
            out("eax") end_low,
            out("edx") end_high,
            "cpuid",
            options(nostack)
        );
    }
    // Combine high and low parts into a single 64-bit value
    let start_cycles = ((start_high as u64) << 32) | (start_low as u64);
    let end_cycles = ((end_high as u64) << 32) | (end_low as u64);
    end_cycles - start_cycles
}

fn main() {
    // Define two different code blocks to profile
    let cycles_block1 = measure_code_cycles(|| {
        // Block 1: A simple for loop
        let mut sum = 0;
        for i in 0..1_000_000 {
            sum += i;
        }
    });
    let cycles_block2 = measure_code_cycles(|| {
        // Block 2: Simulating more complex work
        let mut product = 1;
        for i in 1..1000 {
            product *= i;
        }
    });
    println!("Cycles for Block 1: {}", cycles_block1);
    println!("Cycles for Block 2: {}", cycles_block2);
}

Function as a Parameter: We use a generic function measure_code_cycles that takes a closure, func, allowing any code block to be passed for profiling.
Reusability: This setup allows you to measure any function or block of code, making it easy to compare different algorithms, implementations, or optimizations in a structured and repeatable manner.
Precision and Comparisons: By measuring different blocks, you can directly compare cycle costs and make informed decisions on optimizations.

Output

Running this code will display the number of CPU cycles taken for each code block, allowing you to compare their performance.

Cycles for Block 1: 52345678
Cycles for Block 2: 12567890

Limitations and Considerations

Multi-Core and Hyper-Threaded CPUs: Due to variability across CPU cores and threads, RDTSC and RDTSCP might show inconsistent results on multi-threaded systems. Affinity settings or single-threaded execution can help mitigate this.
Dynamic Frequency Scaling (DFS): Modern CPUs often adjust their frequency dynamically, which can skew cycle counts. Running on a high-performance setting or disabling frequency scaling (if possible) can improve accuracy.
Platform-Specific: This approach is currently limited to x86 and x86–64 platforms, though similar mechanisms exist for other architectures like ARM (e.g., PMCCNTR for ARM CPUs).

Tips for Using `asm!`

Safety: Inline assembly is inherently unsafe. Wrapping asm! in unsafe blocks is required.
Registers: Use reg to let the compiler choose the best available general-purpose register. Specify const instead of in to pass a constant to assembly.
Options: The options argument can specify flags such as volatile, preserve_flags, or nostack, giving you more control over assembly behavior.

When to Use `asm!`

Inline assembly is powerful, but it’s essential to consider when it’s appropriate:

Low-Level Hardware Interaction: Directly interface with hardware where specific CPU instructions are needed.
Performance-Critical Code: Optimize particular code paths by controlling CPU instructions.
Operating Systems and Embedded Programming: When interacting with the OS or low-level hardware, inline assembly provides precise control.

Inline assembly in Rust

Understanding Inline Assembly with `asm!`

Enabling Inline Assembly in Rust

Basic Syntax of `asm!`

A Simple `asm!` Example

Compiling and Running the Code

A More Advanced Example: Bitwise Operation with `asm!`

Real-World Use Case: CPU Cycle Counter

Implementing a Cycle Counter with Inline Assembly

Practice what you learned

Benefits of This Approach

Enhanced Profiling: Using Inline Assembly for More Robust Performance Timing

Improved CPU Cycle Counter Example

Real-World Scenarios and Benefits

Extending Inline Assembly Usage with `RDTSC` and `RDTSCP` in Rust

Measuring the Impact of Different Code Blocks

Output

Limitations and Considerations

Tips for Using `asm!`

When to Use `asm!`

Practice what you learned

Go deeper with these books

Related Articles

Implementing a Network Traffic Analyzer in Rust

Implementing an Application Container in Rust

Memory-Mapped I/O in Rust

Exploring Finite Fields with Rust: Efficient Modular Arithmetic

Master Systems Programming hands-on

Related Articles

Systems Programming
Implementing a Network Traffic Analyzer in Rust
In this article, we’ll delve into the intricacies of working with network traffic using Rust. We’ll explore capturing packets, parsing…

Systems Programming
Implementing an Application Container in Rust
Hey there, Rustaceans! 🦀

Systems Programming
Memory-Mapped I/O in Rust
Memory-mapped I/O is especially helpful when working with big files, loading only the necessary parts into memory. This makes it simpler to…

Systems Programming
Exploring Finite Fields with Rust: Efficient Modular Arithmetic
Finite fields might sound like abstract mathematical concepts, but they are at the heart of many technologies we rely on today, especially…

Understanding Inline Assembly with asm!

Enabling Inline Assembly in Rust

Basic Syntax of asm!

A Simple asm! Example

Compiling and Running the Code

A More Advanced Example: Bitwise Operation with asm!

Real-World Use Case: CPU Cycle Counter

Implementing a Cycle Counter with Inline Assembly

Practice what you learned

Benefits of This Approach

Enhanced Profiling: Using Inline Assembly for More Robust Performance Timing

Improved CPU Cycle Counter Example

Real-World Scenarios and Benefits

Extending Inline Assembly Usage with RDTSC and RDTSCP in Rust

Measuring the Impact of Different Code Blocks

Output

Limitations and Considerations

Tips for Using asm!

When to Use asm!

Practice what you learned

Go deeper with these books

Related Articles

Implementing a Network Traffic Analyzer in Rust

Implementing an Application Container in Rust

Memory-Mapped I/O in Rust

Exploring Finite Fields with Rust: Efficient Modular Arithmetic

Master Systems Programming hands-on

Understanding Inline Assembly with `asm!`

Basic Syntax of `asm!`

A Simple `asm!` Example

A More Advanced Example: Bitwise Operation with `asm!`

Extending Inline Assembly Usage with `RDTSC` and `RDTSCP` in Rust

Tips for Using `asm!`

When to Use `asm!`