## 

### Using ACCL Emulation/Simulation Flow

Lucian Petrica

#### Steps to build ACCL-enabled FPGA application

- Clone ACCL repo(s):
  - https://github.com/Xilinx/ACCL
  - https://github.com/Xilinx/pyaccl
- Build and verify your distributed application
  - With or without FPGA acceleration
  - Using ACCL HLS code emulator and RTL simulator
- Build appropriate CCLO kernel and plugins
- Link with Vitis
  - Against platform, protocol offload engine (POE), and any application kernels
- ▲ Deploy to FPGA

Vivado<sup>™</sup> HLS XILINX

2 AMD Internal Use Only

#### **ACCL** emulation flow demonstration

▲ Learning objectives:

▲ become acquainted with ACCL use-cases and API

- ▲ Learn how to use the simulator and emulator for building host- or PL-driven applications
- ▲ Part 1: Host-driven applications

▲ Part 2: PL-driven, streaming applications

Cloning the Repo Building Simulator and Emulator Running Tests

#### **Cloning the Repo, building Simulator, Emulator**



⊿ [Public]

#### **Building and running Tests**

| <pre>user@host:simulator\$ cd//test/xrt</pre> |                                                    |
|-----------------------------------------------|----------------------------------------------------|
| user@host:xrt\$ cmake . && make <             | <ul> <li>Requires XRT and internet; use</li> </ul> |
| user@host:xrt\$ mpirun -np 3 bin/test         | /usr/bin/cmake if Xilinx tools loaded              |
| [] (test starts with 3 processes; exits when  | n done)                                            |

user@host:emulator\$ python3 run.py -n 3 [-u] ← Emulator can select POE at start-up
[...] (emulator starts with 3 processes; end with Ctrl-C)

6 AMD Internal Use Only

#### Example host-driven ACCL Application Scatter - Vadd - Gather

#### **Typical ACCL system for host-driven applications**

- Most generic way of using ACCL, FPGA acts similar to smart NIC (moves data between host memories)
- Host configures ACCL
- Host issues ACCL calls
- Data moves via FPGA memories and H2D/D2H copies
- Possibly traversing plugins for e.g. compression
- Relevant examples: <u>ACCL XRT tests</u>



#### **Code for host-driven toy application**

```
//ACCL set-up
std::vector<rank t> ranks = generate ranks(true, rank, size);
std::unique ptr<ACCL::ACCL> accl = initialize accl(ranks, rank, true, acclDesign::UDP);
accl->set timeout(1e6); //increase timeout for emulation
//application set-up
unsigned int i, datasize = 8;
auto op buf = accl->create buffer<float>(datasize * size, dataType::float32);
for (i=0; i<datasize*size; i++) op buf->buffer()[i] = 0.0;
auto scatter buf = accl->create buffer<float>(datasize, dataType::float32);
auto res buf = accl->create buffer<float>(datasize, dataType::float32);
auto gather buf = accl->create buffer<float>(datasize * size, dataType::float32);
MPI Barrier(MPI COMM WORLD);
//application compute
accl->scatter(*op_buf, *scatter_buf, datasize, 0); //scatter inputs from rank 0
for (i=0; i<datasize; i++) res buf->buffer()[i] = scatter buf->buffer()[i] + (i + rank);
accl->gather(*res buf, *gather buf, datasize, 0); //gather results to rank 0
```

#### Control **Components in use for host-driven toy example** Data Replicate with mpirun Emulated Arbiter **CCLO** Subsystem **User Application** Host Kernel ZMQ **AXILite** CCLO POE Code Code Switch to AXIS (Py)ACCL CCLO Configuration Calls Driver BFM ZMQ Client(s) Plugins: ZMQ HBM Custom DT, Server Compression

10 AMD Internal Use Only

#### **Running host-driven toy application**

Start host code

| <pre>user@host:simulator\$ cd//test/host-scatter-vadd-gather</pre>          |  |
|-----------------------------------------------------------------------------|--|
| <pre>user@host:host-scatter-vadd-gather\$ cmake . &amp;&amp; make</pre>     |  |
| user@host:host-scatter-vadd-gather\$ mpirun -np 3 bin/scatter-vadd-gather   |  |
| [] (application starts with 3 processes; prints result and exits when done) |  |

Start emulator

user@host:emulator\$ python3 run.py -n 3 -u <<u>Must match acclDesign::UDP setting</u>
[...] (emulator starts with 3 processes; end with Ctrl-C)

Or start simulator

user@host:simulator\$ python3 run.py -n 3 -u -w Must match acclDesign::UDP setting [...] (simulator starts with 3 processes; end with Ctrl-C) and simdll configuration; saves wave



#### Example PL-driven ACCL Application Scatter – PL vadd – Streaming Gather

#### **Typical ACCL system for PL-driven applications**

- Suitable for low-latency applications
- Host configures ACCL
- PL Kernel issues calls
- PL kernel and CCLO exchange data via streams
- ▲ Relevant example: <u>ACCL HLS tests</u>



14 AMD Internal Use Only

#### **Components in use for PL-driven toy example**



Control

Data

#### Host code for PL-driven toy application

```
//ACCL set-up as before
//initialize a CCLO BFM and streams
hlslib::Stream<command word> callreg, callack;
hlslib::Stream<stream word> data cclo2krnl, data krnl2cclo;
std::vector<unsigned int> dest = {9};
CCLO BFM cclo(5500, rank, size, dest, callreq, callack, data cclo2krnl, data krnl2cclo);
cclo.run(); MPI Barrier(MPI COMM WORLD);
//application set-up like before, but no res buf
//scatter from host
accl->scatter(*op buf, *scatter buf, datasize, 0); //scatter inputs from rank 0
//run the hls kernel, using the global communicator
vadd mem2stream gather(
   scatter buf->buffer(), gather buf->physical address(), datasize, rank,
  accl->get communicator addr(),
  accl->get arithmetic config addr({dataType::float32, dataType::float32}),
   callreq, callack, data krnl2cclo, data cclo2krnl);
//get results from FPGA memory
gather buf->sync from device();
```

#### Kernel code for PL-driven toy application

Hide command streams behind accl\_hls interface



#### **Running PL-driven toy application**

Start host code



Allows user kernels to attach to CCLO streams

user@host:emulator\$ python3 run.py -n 3 -u --no-kernel-loopback
[...] (emulator starts with <u>3 processes; end with Ctrl-C)</u>

Or start simulator

Start emulator

user@host:simulator\$ python3 run.py -n 3 -u -w --no-kernel-loopback
[...] (simulator starts with 3 processes; end with Ctrl-C)

#### **Disclaimer & Attribution**

Timelines, roadmaps, and/or product release dates shown in these slides are plans only and subject to change.

The information contained herein is for informational purposes only and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD's products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale.

©2023 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Athlon, CDNA, EPYC, Infinity Fabric Radeon, RDNA, ROCm, Ryzen, Ryzen Threadripper, Xilinx, the Xilinx logo, Alveo, Artix, Kintex, Spartan, Versal, Vitis, Virtex, and Zynq and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft is registered trademark of Microsoft Corporation in the US and other jurisdictions. SPEC®, SPECrate®, SPECint and SPEC CPU® are registered trademarks of the Standard Performance Evaluation Corporation. See www.spec.org for more information. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

# AMDJ