Optimizing BL808 Interrupts

August 27, 2023

I recently got a Pine64 Ox64, a fast but confusing microcontroller board in the form factor of a Pi Pico. Because it's so fast, I wanted to see how quickly I could get it to respond to a GPIO interrupt.

This chip has 3 RISC-V cores, but only 2 that we need to worry about. The first is named "d0", and is a RV64IMAFCV core that runs at 480 MHz. However, it's only useful for selected tasks, because it doesn't connect to the full range of on-chip interrupts or on-chip memories. The second core is named "m0" and is a RV32IMAFCP CPU at 320 MHz, and has more peripheral connections than d0. (Basically, d0 is intended for multimedia processing, while m0 is intended for controlling the wireless features and the on-chip peripherals.)

This means is that you have to use the slightly slower, 32-bit-only core for GPIO interaction.

Method

A common way to measure interrupt response time on microcontrollers is to cycle an output GPIO pin at fairly high speed, connect it to an interrupt input pin, then have the interrupt service routine set a third pin high. Then, if you monitor the first pin and the third pin on an oscilloscope, you can measure the amount of time the microcontroller took to process the interrupt.

Environment and Hardware

I downloaded the Bouffalo SDK and set it up with the configuration instructions (these are only in Chinese, but translate well). I used the version from Git hash cf71fcf, which was HEAD at the time I cloned the repo.

The BL808 downloads code over UART, but Bouffalo somehow made a chip that isn't compatible with just every standard UART adapter. In particular, the FT232RL does not work. I used a Pi Pico with the firmware linked on the Ox64 wiki page.

Attempt 1

Code

There's no "create new project" command in the Bouffalo SDK, so I copied the "helloworld" example to another directory and used that as my project. Since we're only using the m0 core, I edited the Makefile to be:

SDK_DEMO_PATH ?= .
BL_SDK_BASE ?= <SDK base path>

export BL_SDK_BASE

CHIP ?= bl808
BOARD ?= bl808dk
CPU_ID ?= m0
CROSS_COMPILE ?= riscv64-unknown-elf-

# add custom cmake definition
#cmake_definition+=-Dxxx=sss

include $(BL_SDK_BASE)/project.build

This is just the default Makefile, slightly modified. Note the CHIP, BOARD, and CPU_ID definitions, which save me from having to put these variables on the command line every time I run make. (For the record, the Ox64 isn't exactly the same as "bl808dk", but it's close enough for this).

I then made a simple main.c:

#include <stdbool.h>

#include "bflb_mtimer.h"
#include "bflb_gpio.h"
#include "board.h"

#define GPIO_INTERRUPT_PIN (GPIO_PIN_26)
#define GPIO_SOURCE_PIN (GPIO_PIN_27)
#define GPIO_OUTPUT_PIN (GPIO_PIN_24)

struct bflb_device_s* gpio;

void read_mem_isr(int irq, void* arg) {
    bflb_gpio_set(gpio, GPIO_OUTPUT_PIN);
    bflb_gpio_int_clear(gpio, GPIO_INTERRUPT_PIN);
}

int main() {
    board_init();

    gpio = bflb_device_get_by_name("gpio");
    bflb_gpio_int_init(gpio, GPIO_INTERRUPT_PIN, GPIO_INT_TRIG_MODE_SYNC_RISING_EDGE);
    bflb_gpio_int_mask(gpio, GPIO_INTERRUPT_PIN, false); // false means enable
    
    bflb_gpio_init(gpio, GPIO_SOURCE_PIN, GPIO_OUTPUT | GPIO_PULLUP | GPIO_SMT_EN | GPIO_DRV_0);

    bflb_gpio_init(gpio, GPIO_OUTPUT_PIN, GPIO_OUTPUT | GPIO_PULLUP | GPIO_SMT_EN | GPIO_DRV_0);
    
    bflb_irq_attach(gpio->irq_num, read_mem_isr, gpio);
    bflb_irq_enable(gpio->irq_num);

    while (1) {
        bflb_gpio_set(gpio, GPIO_SOURCE_PIN); // trigger the interrupt (source is wired to trigger)
        bflb_mtimer_delay_ms(1);
        // reset both pins
        bflb_gpio_reset(gpio, GPIO_SOURCE_PIN);
        bflb_gpio_reset(gpio, GPIO_OUTPUT_PIN);
        bflb_mtimer_delay_ms(1);
    }
    
    return 0;
}

GPIO_SOURCE_PIN is the pin that goes high periodically, GPIO_INTERRUPT_PIN is the pin that receives the interrupt, and GPIO_OUTPUT_PIN is the pin that gets set high by the interrupt handler. We're just using Bouffalo SDK functions for the first attempt.

To compile, just run make in the project directory. To flash, run make flash COMX=<your serial port>. Because the Makefile defines the chip, board, and CPU ID, we don't need to specify these on the command line.

Result

This is, quite simply, terrible. I wouldn't even expect 37 microseconds from a 20-year-old chip, let alone a 2ish-year old one. This image is a screenshot from my oscilloscope software, and the time at the bottom of the picture is the delta between the left and right white dashed lines. What you can't see in this picture is that the yellow rising edge jumped around a lot, and was on average closer to 45 microseconds.

Something's clearly wrong, let's try again.

Attempt 2

Did you catch the problem in the code for the first attempt? Look at the interrupt configuration again:

bflb_gpio_int_init(gpio, GPIO_INTERRUPT_PIN, GPIO_INT_TRIG_MODE_SYNC_RISING_EDGE);

The BL808 reference manual defines a total of 8 types of GPIO interrupts, but we're interested in only the rising edge ones.

The synchronous interrupt (GPIO_INT_TRIG_MODE_SYNC_RISING_EDGE) will sync the interrupt to a 32 kHz clock, which explains both the delay (1 cycle at 32 kHz = 31 microseconds) and the jitter (the time between the rising edge and the next edge of the clock will vary). The asynchronous interrupt will forward the interrupt as fast as possible, without waiting for a clock.

Here's what I got when I changed the interrupt to async (GPIO_INT_TRIG_MODE_ASYNC_RISING_EDGE):

Result

596 nanoseconds. This is much, much better and far closer to what I would expect for a 320 MHz processor. There's also no more jitter, because there's no sync clock and probably also because the CPU's cache has kicked in.

Attempt 3

Even though this is much better than my first attempt, I still figured I could do better. One way to improve speed on every microcontroller is to bypass whatever SDK libraries are in use. On basically all platforms, there is at least one layer of indirection between the interrupt vector and the actual interrupt service routine, plus, using functions like bflb_gpio_set() adds overhead because of the function call and any error checks, if implemented. I decided to modify the lowest level interrupt handler to set the pin high as fast as possible.

Microcontroller SDKs usually provide basic startup and vector table code to be compiled into your project. For the m0 core on the BL808, these are stored in the Bouffalo SDK directory, at drivers/soc/bl808/std/startup/m0. We want the file vector.S, which has the RISC-V assembly code for the function default_interrupt_handler. This function is called by the RISC-V core itself when an interrupt occurs, and saves registers before jumping to the SDK interrupt handler.

With 3 lines of assembly, we can make this function set the output GPIO pin high and optimize the interrupt response as much as possible. Place these three lines just beneath the default_interrupt_handler: label:

li   t0, 0x20000aec // location of register gpio_cfg138
li   t1, 0x1000000 // set pin 24 high
sw   t1, 0(t0)

(My syntax highlighter doesn't have a RISC-V assembly mode!) This code loads the address of the gpio_cfg138 register into temporary register 0 and the number 0x1000000 into temporary register 1, then writes the contents of t1 to the memory location in t0. gpio_cfg138 is a set/reset register on the BL808 GPIO, so setting bits in this register causes GPIO pins to go high and clearing bits causes pins to go low. 0x1000000 is equal to 1 << 24, which, when written to gpio_cfg138, will make GPIO 24 go high.

I did not modify the interrupt service routine, because we still need to clear the interrupt with bflb_gpio_int_clear().

Results

152 nanoseconds. The timing for this attempt was as stable as attempt 2.

Conclusion and Comparison

There you have it. Like usual, if you get your hands dirty in assembly, you can make the computer go really fast.

BL808 vs. STM32F405

I did this same interrupt speed test a while ago on an STM32F405 Feather running at full speed (168 MHz) under the Cube framework. The fastest I got this chip to respond was 172 ns:

I didn't even have to go into assembly to get this. The STM32Cube framework provides for a versatile interrupt handler, but you can also edit the actual interrupt vector function in C (no assembly required!). I just added a write to the GPIO set/reset register at the start of the interrupt vector function to get this speed.

I think that the BL808 could respond faster than what I got. The m0 core is almost twice as fast as the STM32F405 but is only a little bit faster at responding to interrupts. There are many factors that could cause this: caching, peripheral bus speeds, and the time it takes a specific CPU to respond to an interrupt, to name a few. Hopefully documentation improves for the BL808 and and I'll be able to create a more definitive description of the chip configuration for this experiment.

I hope you enjoyed this exploratory writeup. I'll try to update this post if I get any more cool microcontroller boards.