# **Transient Execution Attacks**

Mengjia Yan

Spring 2025





# Outline

• Speculative execution



• Spectre and its variations



#### **Recap: 5-stage Pipeline**



# **Recap: 5-stage Pipeline**



- In-order execution:
  - Execute instructions according to the program order
  - One instruction max per pipeline stage

t5 t6 t7 time t0 t1 t2 t3 t4 . . . .  $IF_1 ID_1$  $\mathsf{EX}_1$ instruction1  $MA_1$ **WB**<sub>1</sub>  $IF_2$ instruction2  $ID_2 EX_2$  $MA_2$  $WB_2$ IF<sub>3</sub> EX<sub>3</sub> instruction3  $ID_3$  $MA_3$ WB<sub>3</sub>  $IF_4$ ID<sub>4</sub> EX<sub>4</sub> MA<sub>4</sub> WB<sub>4</sub> instruction4 instruction5  $IF_5$  $EX_5 MA_5 WB_5$ 

### **Build High-Performance Processors**

Example #1: FMUL f1, f2, f3 ; 10 cycles ADD r4, r4, r1 ; 1 cycle -> repeat 10

Instruction-Level Parallelism (ILP)

Example #2:

when there is **NO** data-dependency or control-flow dependency

LD r3, 0(r2) ; 1-100 cycles ADD r4, r4, r1 ; 1 cycle -> repeat 10 times

•••••

.....

#### **Technique #1: Add More Functional Units**



#### **Technique #1: Add More Functional Units**



#### **Technique #1: Add More Functional Units**



| Functional Unit | Busy? | Dest Reg | Src1 Reg | Src2 Reg |
|-----------------|-------|----------|----------|----------|
| Int ALU         |       |          |          |          |
| Mem             |       |          |          |          |
| Fadd            |       |          |          |          |
| Fmul            |       |          |          |          |
| Fdiv            |       |          |          |          |

| Functional Unit | Busy? | Dest Reg | Src1 Reg | Src2 Reg |
|-----------------|-------|----------|----------|----------|
| Int ALU         |       |          |          |          |
| Mem             |       |          |          |          |
| Fadd            |       |          |          |          |
| Fmul            | Y     | f1       | f2       | f3       |
| Fdiv            |       |          |          |          |

→2: ADD r4, r4, r1

No dependency, feel free to issue the ADD

| Functional Unit | Busy? | Dest Reg | Src1 Reg | Src2 Reg |
|-----------------|-------|----------|----------|----------|
| Int ALU         |       |          |          |          |
| Mem             |       |          |          |          |
| Fadd            |       |          |          |          |
| Fmul            | Y     | f1       | f2       | f3       |
| Fdiv            |       |          |          |          |

Read-after-Write (RAW)

- 1: FMUL f1, f2, f3
- 2: FDIV f5, f1, f4

Write-after-Write (WAW)

1: FMUL f1, f2, f3 ; 10 cycles

- Upon issue an instruction, check:
  - 1. Whether any ongoing instructions will generate values for my source registers
  - 2. Whether any ongoing instructions will modify my destination register

We call such a processor: in-order issue, out-of-order completion.

A problem: how to handle interrupts/exceptions?

#### **Exception in OoO Processors: Example #1**

| 1: LD r3, 0(r2) ; Exception in 3 cycles<br>2: ADD r4, r4, r1 ; 1 cycle Need to delay WB |        |    |    |       |       |     |      |     |           |  |
|-----------------------------------------------------------------------------------------|--------|----|----|-------|-------|-----|------|-----|-----------|--|
|                                                                                         |        | 1  | 2  | 3     | 4     | 5   | 6    | 9   | 8         |  |
|                                                                                         | 1: LD  | IF | ID | lssue | ALU   | Mem | Mem• | Mem | Exception |  |
|                                                                                         | 2: ADD |    | IF | ID    | lssue | ALU | WB   |     |           |  |

#### **Exception in OoO Processors: Example #2**



#### **Technique #3: In-order Commit**



#### Another Way to Draw It



To know more advanced out-of-order (OoO) features, take 6.5900 [6.823]

# **Virtual Memory**

Virtual memory (x86\_64 Linux)

| 0x00000000   |                                        | user function:                                          |
|--------------|----------------------------------------|---------------------------------------------------------|
|              | Code (user code<br>+ shared libraries) | <pre>void hello_world(){ printf("hello word!"); }</pre> |
|              |                                        | libc function:                                          |
|              | User Space                             | <pre>void printf(){ write(); }</pre>                    |
|              | Heap/stack                             |                                                         |
| 0xffff800000 |                                        |                                                         |
|              | Kernel code                            | <pre>system call:<br/>void sys_write(){ }</pre>         |
|              | Kernel Space                           | Why map kernel<br>address into user<br>address space?   |
|              | Kernel Heaps/stacks                    | address space:                                          |
| 0xffffffff   |                                        |                                                         |

#### **Recap: Page Mapping**



# **Mapping Kernel Pages**



# **Jumping Between User and Kernel Space**



- Context switch overhead:
  - Page table changes introduce perf overhead, e.g., flush TLB in some processors
- And sometimes, we only go to kernel to do some simple things, getpid()
- Performance optimization:
  - Map kernel address into user space in a secure way, so no need to swap page tables

# Map Kernel Pages Into User Space Securely



# Meltdown



# Meltdown

- Meltdown explores the combined effects of two optimizations
  - Hardware optimization: out-of-order execution
  - Software optimization: mapping kernel addresses into user space
- Attack outcome: user space applications can read arbitrary kernel data

Goal: in user space, pick a kernel\_address and leak its content

..... Ld1: uint8\_t secret = \*kernel\_address; Ld2: unit8\_t dummy = probe\_array[secret\*64];



# **Meltdown Timing**

Ld1: uint8\_t secret = \*kernel\_address; Ld2: unit8\_t dummy = probe\_array[secret\*64];

**Case 1: Fail.** Ld2 is squashed before the corresponding memory access is issued.

**Case 2: Attack works.** Ld2's request is sent out before the instruction is squashed.



# Meltdown w/ Flush+Reload

- 1. Setup: Attacker allocates probe\_array, with 256 cache lines. Flushes all its cache lines
- 2. Transmit: Attacker executes

```
.....
Ld1: uint8_t secret = *kernel_address;
Ld2: unit8_t dummy = probe_array[secret*64];
```

Receive: After handling protection fault, attacker performs cache side channel attack to figure out which line of probe\_array is accessed → recovers byte

# Why it takes so long for Meltdown to be discovered?

Software

Hardware

SW optimization: Map kernel address in user space

**Contract:** Memory access goes through page permission check, and permission violation raises exceptions

HW optimization: Speculation to delay exception handling

# **Meltdown Mitigations**

- Stop one of the optimizations should be sufficient
  - SW: Do not let user and kernel share address space (KPTI) -> broken by several groups (e.g., EntryBleed)
  - HW: Stall speculation; Register poisoning

```
.....
Ld1: uint8_t secret = *kernel_address;
Ld2: unit8_t dummy = probe_array[secret*64];
```

• We generally consider Meltdown as a design **bug** 

Will Liu, EntryBleed, https://www.willsroot.io/2022/12/entrybleed.html?m=1

# Spectre and its Variants



```
void func(int x){
   //prevent out-of-bound array access
   if (x < array_size) {
      val = array[x]
   }
   return val;
}</pre>
```

#### **Branch Prediction**

- Motivation: control-flow penalty
  - Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution!



#### **Branch Prediction**

- Naïve approach: PC+4
- More advanced, predict two things:
  - Direction of a conditional branch (whether a branch is taken or not)
    - blt r1, r2, <label>

Idea: 1-bit predictor for loop

- The target address of a branch
  - jalr <reg>
  - ret

Idea: memorizing branch source and destination pairs

# A Simple Branch Predictor Unit (BPU)



- When branch instruction commits
  - Update the predictor
- In the fetch stage
  - Use the predictor to decide what address to fetch next
- Limited space?
  - Use selected bits in PC to index into the predictor

# Spectre V1 – Speculative Out-of-Bound



Attacker to read arbitrary memory:

1. Setup: Train branch predictor

```
2. Transmit: Trigger branch misprediction; & array1[x] maps to some desired kernel address
```

3. Receive: Attacker probes cache to infer which line of *array2* was fetched

# Spectre V2 – Speculative JOP

1. Insert <PC\_train, PC\_spectre>

- 2. Trigger PC\_victim\_src
  - 3. Speculative execute PC\_spectre



#### **General Attack Schema**



DAWG: A Defense Against Cache Timing Attacks in Speculative Execution Processors. Kiriansky et al. MICRO'18

# **Apply the General Attack Scheme**

The RSA Square-and-Multiply Exponentiation example. Attackers aim to leak e



```
r = 1
for i = n-1 to 0 do
       r = sqr(r)
          = mod(r, m)
       r
       if e_i == 1 then
               r = mul(r, b)
               r = mod(r, m)
       end
end
```



# **General Attack Schema**



- Traditional (non-transient) attacks
- Hard to fix

Hard to fix

- Gadget preexist in victim space. Leak data in-use
- Transient attacks: Gadget constructs via speculation. Leak data-at-rest
  - Meltdown = transient execution + deferred exception handling
  - Spectre = transient execution on wrong paths

"Easy" to fix

# **Next: Cache Attack Recitation**



