High Performance and Efficiency 512-B & 1024-B VLEN Vector Processor and AI Related Accelerator - Nathan Ma, Nuclei System Technology
In this presentation, we delve into the powerful synergy between RISC-V Vector Processing, with a spotlight on the transformative RVV1.0 extension (specifically on VLEN=512b and 1024b), and AI acceleration. RISC-V, becomes even more impactful with the introduction of the RVV1.0 extension, specifically designed to elevate vector processing capabilities. In 2024, we released our Intelligence Class Core IP Series, specifically focus on AI applications and others require intensive parallel vector computing capability.
Enhancing RISC-V ISA to Support Sub-FP8 Quantization for Machine Learning Models -
Mengshiun Yu & Jhih-Kuan Lin, National Tsing Hua University
In this session we'll present our research proposes extending the RISC-V Instruction Set Architecture (ISA) to support sub-FP8 quantized data formats, optimizing AI and machine learning models for low-power edge devices. The study develops new instructions to enable the RISC-V CPU core to handle data types below FP8, such as 6-bit and 4-bit formats. These improvements enhance AI workload performance and energy efficiency, allowing complex machine learning tasks to be performed locally on edge devices like smartphones, IoT devices, and wearables. The proposed ISA extension supports mixed-precision workloads and ensures backward compatibility with existing hardware for easy adoption. The research includes designing a new sub-FP8 extension with computational, configuration, load/store, and conversion instructions. The design is demonstrated with two examples using assembly code: one for adding two FP8 (E5M2) values and another for performing saxpy computation with vector extension.
Towards Generative AI for RISC-V Verification - Sergei Chirkunov, Imagination Technologies
Generative AI has considerable potential in CPU verification. In this work, we adapt networks and techniques developed in the context of large language models (LLMs) for natural language processing to RISC-V assembly sequences to facilitate future applications to CPU verification. In particular, we demonstrate the ability to generate novel assembly sequences of guaranteed-valid instructions with a small, efficient language model. We anticipate that our work will ultimately facilitate a variety of verification tasks such as stimulus generation, assessment of the similarity between sequences, and identification of minimal test batteries that exercise the state space.
The Efficient Way to Design a RISC-V Edge AI Processor with Software Hardware Co-Design Methodology - Meng Zhang, Terapines Technology (Wuhan) Co., Ltd
This talk will show you how to improve the performance of an AI model running on a virtualized RISC-V architecture with software hardware co-design methodology. This method can be done all the way from micro-architecture design, to support adding customized instructions in compiler, debugger and simulator, and to profile AI model performance on virtualized platform by one person in as short as a few hours, without knowing how to customize compiler, debugger or simulator as all of those have automatically done in the our software hardware co-design flow.
Creating Custom RISC-V Processors Using ASIP Design Tools: A Neural Network Acceleration Case Study - Gert Goossens, Synopsys
The AI revolution triggers an increased awareness for application-specific instruction-set processors (ASIPs). A RISC-V architecture can be extended with specialized datapaths, storages, and custom instructions to accelerate AI workloads. New instructions can be encoded in RISC-V's reserved opcode space or in additional parallel slots of an extended long instruction word. Notwithstanding the specialization, compatibility with and reuse of the RISC-V ecosystem is maintained.
Synopsys’ ASIP Designer tool-suite enables the design of custom RISC-V processors. Starting from a formal ISA model, it assists designers in selecting ISA extensions, generates an SDK with an optimizing compiler supporting the extensions, and produces an efficient RTL implementation.
We illustrate this approach with the design of a custom RISC-V processor to accelerate convolutional neural network algorithms for edge AI, with programming support for TensorFlow Lite for Microcontrollers (TFLM). ISA specialization includes the introduction of 4-lane SIMD with a local vector memory, 4 specialized convolution units with 16 multipliers each, dedicated accumulator registers, and 2-way instruction-level parallelism.
Towards an Integrated Matrix Extension: Workload Analysis of CNN Inference with QEMU TCG Plugins - Matheus Ferst, Instituto de Pesquisas ELDORADO
Following the gap analysis done in the second half of 2023, the SIG-Vector has been working on specifying instructions to accelerate matrix operations. Two Task Groups were proposed to explore different approaches. The "Attached Matrix Extension" (AME) is working on a set of instructions independent of other extensions and requires new registers to hold matrix data. The Integrated Matrix Extension (IME) proposes the reuse of the Vector Registers introduced by the V extension. The AME solution is similar to how other architectures added matrix operations, like Intel's AMX and ARM's SME, while the IME proposal resembles how the POWER architecture added matrix operations. The IME might also help applications that interleave matrix and vector operations by avoiding data movement between different types of registers.
To verify how commonly that happens on AI/ML workloads, we developed a QEMU TCG Plugin to instrument the inference of eight CNN models optimized to use the IME-like POWER10 matrix instructions. The results also show some types of vector operations that interact with matrix data and would be helpful in an AME implementation to avoid sending data back to memory.
Enhancing the Future of AI/ML with Attached Matrix Extension - Jing Qui, Alibaba
We've now updated Xuantie Attached Matrix Extension ISA to keep pace with rapid advances in AI.
The new matrix ISA uses 64-bit instructions. These self-contained long instructions can support more architectural registers, facilitate sparse operations, include longer immediates and more metadata. This enhanced encoding scheme increases both the flexibility and efficiency of matrix computations. Another enhancement is the introduction of structured sparsity techniques that allow for variable sparsity ratios (N:M sparsity) across k dimensions. The new extension also supports innovative data types, such as int4/fp8, commonly used in large language models. In addition to multi-precision, it also supports mixed-precision operations. Har