Sharpie, part 1: generating video signals

January 17, 2025

Last updated March 12, 2025

I have something of an...obsession with neat displays: e-paper, VFD, whatever. Over a year ago, I reverse engineered a e-ink demo kit and promised to make a Linux terminal on it. That never came to pass because affine image transformations are really really hard with zero knowledge of linear algebra.

More recently, I spent multiple months 3D modeling a case and began working on software for a portable device intended to echo the Game Boy Advance, designed for a monochrome Sharp Memory LCD. That project got a lot farther than the e-ink terminal but was much too ambitious, so it died too.

But behold! Not too long ago, I was digging around on Digikey, looking at their display collection, and found something new for sale:

A color Sharp memory display, readily available from a reliable source, for a fairly decent price ($28 at the time of writing)? Somehow, yes.

Aside

When looking for a hobby project to work on, I usually try to find a problem to solve. Not this time: whatever comes out the other end of this will be, at best, a demo. That's ok! The learning and the fun are the real motivation.

This project is called Sharpie. What you're reading now is part 1 of a saga of unknown length documenting the creative process behind a project with no explicit purpose.

Fair warning: this blog post has a lot of technical things smashed together. I've tried to make them clear but it might not be for everyone. In particular, you will need some beginner-to-moderate embedded knowledge to completely understand this post.

The display itself

This is a Sharp model LS021B7DD02 (which I call simply the "LS021"), a 240x320 64-color memory display. Compared to traditional TFT displays, Sharp memory displays require even less power and don't need backlights, but can be updated much faster than e-ink screens. The Pebble Time smartwatch from circa 2015 used this same type of screen, only smaller, at 1.3 inches diagonal versus the 2.1 inches diagonal of the LS021.

By Frmorrison at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=42694236. Isn't it pretty?

The downside of being cool color e-paper is that the LS021 requires a decent number of carefully timed signals to display an image. Sharp's monochrome displays generally use slightly modified SPI, but this one requires you to generate a total of 12 synchronous data, clock, and pulse signals. The rest of this post will be about making these signals from a microcontroller.

Video signals

Another warning: I haven't bought any screens yet, nor have I tried to send them signals and display images. Everything you are about to read is based only on datasheet information. Take it with the appropriate amount of salt.

To begin with: the datasheet is the best documentation for the display interface. Unfortunately, it's also incomprehensible, so I'll restate what it says in much cleaner terminology.

Like any display, the LS021 has pixels arranged in a grid (see the image below from section 6-6, PDF page 33 in the datasheet). There are 240 pixels in the X direction, and 320 in the Y direction, so the display is "upright" when it's in portrait orientation. Each pixel is composed of three subpixels, two each for the red, green, and blue component of every pixel. Because this display doesn't have a controller chip integrated, we have to generate sync and data signals ourselves.

Aside

Astute readers will note that 64 colors means 2 bits per pixel, yet this display has three subpixel elements per color. Be patient, young grasshopper, we'll get there in another post.

To begin with, here are the signals the display needs. You can reference the datasheet (Table 6-3-2) but you won't need to if you keep reading.

Signal	Purpose
BCK	"Binary-Driver" (horizontal) clock: edges trigger the screen to sample data
BSP	"Binary-Driver" start signal: rises briefly twice per line to separate groups of data
INTB	Initial start signal for the whole display: this stays high for most of each frame
GCK	"Gate-Driver" (vertical) clock: edges trigger the screen twice per line
GSP	"Gate-Driver" start signal: rises briefly once per frame
GEN	"Gate Enabled" signal: I don't know what this does but it seems to be related to completing a line
R[0-1], G[0-1], B[0-1]	Pixel data: be aware, the signal format on these lines is quite strange---we'll look at that another time

What Sharp calls the "Binary-Driver" is what other displays call "horizontal" or "source": signals that control pixels within a single line of a frame. In particular, "source" and "gate" are terms that seem to be associated most commonly with E-ink screens. Maybe Sharp took some inspiration. (Fortunately, the similarity ends there, this display does not need waveform LUTs or any other complex driving compensation).

BSP is a short pulse that rises twice on every line, used to indicate the start of portions of data, and BCK is a clock that tells the horizontal subsystem to sample the data lines. BCK could be considered the "dot clock" (except most dot clocks are not double-edged like BCK) of this display. I consider it the most fundamental signal of the LS021's interface.

All 12 signals must follow certain timing guidelines (Table 6-3-3, PDF page 25). BCK specifically must have high and low periods between 660 and 680 nanoseconds. BCK, and most other signals, are 50% duty cycle, so this equates to a period T = 1320 to 1360 ns. All the other signals except for GEN are based on the BCK clock (and even GEN can be linked to BCK), so producing BCK is the first step.

Terminology Note

All LS021 clock signals (not pulses) are 50% duty cycle, and the pulses are generally described by the amount of time they are high. Their allowed timing ranges are specified in the datasheet as individual times for high and low periods---because within one period of a square wave, there are two edges. I refer to these high and low periods as "h/l", or plural "h/ls". For example: within the period T of a wave, there are two h/ls.

The LS021 datasheet shows us how many BCK pulses are in one GCK pulse, how many GEN pulses need to be generated, and how all these signal edges need to line up. BCK is our ticket to a complete video signal.

Generating LS021 signals

I have chosen the RP2350 microcontroller for Sharpie, because like the RP2040 before it, the 2350 includes RP2 programmable I/O (PIO). LS021 signals are much too fast to be reliably bit-banged, and because I intend to make actual Sharpie hardware, I don't want to include an FPGA (the Pebble Time mentioned earlier used an STM32 and an FPGA).

PIO are extremely basic but very deterministic state machines that execute a handful of instructions on a pre-determined clock. This means you can use them to accurately make digital signals. RP2350 has a default system clock of 150 MHz, and PIO can be driven at up to system clock speed. With a little bit of Python, I found two potential integer clock dividers to generate BCK h/ls within spec (fractional clock division would probably be fine but I'm not comfortable taking that risk):

Divider	Resulting value
198	150e6 / 198 => BCK h/l = 660 ns
200	150e6 / 200 => BCK h/l = 666.66666.... ns

The only option here that isn't on the very edge of spec is 200. We'll actually be using a clock divider of 50---1/4 of 200---for reasons that will become clear shortly.

Presenting PIO code

We now have a basic understanding of the LS021 video signals, their timing, and how we could configure PIO to match this timing. Now, it is time for code. I'm going to explain my code under the assumption that you, my dear reader, already have a decent understanding of programming PIO.

.pio_version 0

.program sharpie_horiz_data
.side_set 2

; side-set: BCK and BSP
;           bit 1   bit 0
; this program controls BSP, BCK, and the data pins (6 data bits but
; shift 8 is fine)

; each instruction is 166 ns (1/4 of a BCK cycle)

; counter in X and backup in Y are charged by the CPU via forced instructions
.wrap_target

wait 1 irq 2      side 0b00     ; wait for GCK1 rise (waits take two cycles)
restart:
mov x, isr        side 0b01 [1] ; BSP rises 333 ns after GCK1 rises and charge X for this loop
pull              side 0b11 [1] ; BCK1 rises 333 ns after BSP rises, also fill OSR 
                                ; (this will only actually happen on the first loop, because the OSR 
                                ; has just been cleared by a forced instruction)
out pins, 8       side 0b11 [1] ; hold BCK1, BSP still high, set data out
nop               side 0b01 [1] ; fall BCK1, BSP still high
loop:
out pins, 8       side 0b00     ; fall BSP, next data out, middle of BCK2
jmp !x, exit      side 0b00     ; exit the loop if it's the last iteration (data goes to 0 on BCK121)
nop               side 0b10 [1] ; rise BCK
out pins, 8       side 0b10 [1] ; hold BCK, data out
jmp x--, loop     side 0b00 [1] ; fall BCK, jump

exit:
nop               side 0b10 [1] ; rise BCK121
mov pins, null    side 0b10 [1] ; set data pins to zero
nop               side 0b00 [3] ; fall BCK122 and hold for all of 122
nop               side 0b10 [3] ; rise BCK123
jmp y--, restart  side 0b00 [1] ; fall BCK124, reach middle, restart

.wrap

huh?

PIO version 0? but isn't Sharpie targeting RP2350 with PIO version 1? yes well I already had a Pico 1 with headers and I didn't feel like soldering to my only RP2350 board. The RP2040 can be overclocked to 150 MHz with a single

set_sys_clock_khz(150000, true);

so we can pretend it's an RP2350.

And the supporting C code (I chose to put it in the PIO file):

#include "hardware/gpio.h"
static inline void sharpie_horiz_data_pio_init(PIO pio, uint sm, uint offset, uint bsp_pin, uint r0_pin) {
  pio_gpio_init(pio, bsp_pin); // BSP on PIO
  pio_gpio_init(pio, bsp_pin + 1); // BCK on PIO

  for (int i = 0; i < 6; i++) {
    pio_gpio_init(pio, r0_pin + i); // all color data pins on PIO
  }

  pio_sm_set_consecutive_pindirs(pio, sm, bsp_pin, 2, true); // BCK and BSP as output
  pio_sm_set_consecutive_pindirs(pio, sm, r0_pin, 6, true); // color data pins as output

  pio_sm_config c = sharpie_horiz_data_program_get_default_config(offset);

  // BCK, BSP are side-set pins
  sm_config_set_sideset_pins(&c, bsp_pin);
  sm_config_set_out_pins(&c, r0_pin, 6);
  // shift right, autopull enabled, autopull threshold 32 bits 
  // (entire OSR has been shifted out)
  sm_config_set_out_shift(&c, true, true, 32); 
  // 150 MHz / 25 = 6 MHz => T = 166.66666... ns
  sm_config_set_clkdiv(&c, 25);

  pio_sm_init(pio, sm, offset, &c);
  pio_sm_set_enabled(pio, sm, true);
}

This code is intended to be mapped to state machine 2 (SM 2) in any PIO block, with a clock divider of 50 given a system clock of 150 MHz, which means that each instruction takes 1/4 of a BCK h/l. The C code attached to it connects GPIO pins to the PIO through the chip multiplexer, sets pin directions to output, configures the side-set, output, autopull, and clock divider, then enables the state machine. Autopull is configured to 32 bits, so that the OSR will only be refilled after it has been entirely shifted out. This is so that we can connect a DMA stream to the TX FIFO and send data directly from memory to the display (again, we'll get there).

Let's walk through the instructions.

wait 1 irq 2      side 0b00 ; wait for GCK1 rise (waits take two cycles)

A different state machine will assert IRQ 2 within this PIO to indicate to SM2 that it is time to do stuff. Specifically, SM 0 will assert IRQ 2 on the first falling edge of GCK1. The diagram below shows this relationship (Diagram 6-3-3 from the datasheet with the bottom cut off):

The time between the GCK falling edge and the BSP rising edge is labeled tsGCK2 and is equal to 1/2 of a BCK h/l. The reason that SM 2 is driven at 4x the speed it seems like it would need is because the wait instruction takes two cycles to complete...or something. This isn't well explained in the PIO documentation. I initially tried to get SM 2 to work with just 1/2 BCK h/l as its clock speed, but ultimately gave in and used 1/4. This ended up being useful later anyway.

restart:
mov x, isr        side 0b01 [1] ; BSP rises 333 ns after GCK1 rises and charge X for this loop
pull              side 0b11 [1] ; BCK1 rises 333 ns after BSP rises, also fill OSR

After IRQ 2 is set, BSP rises with the side-set on the mov instruction. This move copies the contents of the ISR (here used only as a backup) into the X register to be used as a counter in just a bit. The ISR itself is loaded by some forced instructions run by the CPU, shown below, while SM 2 is waiting for IRQ 2 to arrive. Data is pushed onto the FIFO and read out into the necessary registers:

// we'll get a total of 2(x+1)+4 h/l so this should be 59 => 2(59+1)+4 = 124
pio_sm_put(pio, horiz_data_sm, 59);
// pull
pio_sm_exec(pio, horiz_data_sm, pio_encode_pull(false, false));
// out isr, 32 (make backup of counter value and clear OSR for autopull)
pio_sm_exec(pio, horiz_data_sm, pio_encode_out(pio_isr, 32));
// mov x, isr (load X with counter)
pio_sm_exec(pio, horiz_data_sm, pio_encode_mov(pio_x, pio_isr));

// now charge Y for total loop counter

// should be 640 for 641 loops (see 6-3-2, the last loop 
// has data all zeros, which is why we have a chained DMA channel later)
pio_sm_put(pio, horiz_data_sm, 640); 
// pull
pio_sm_exec(pio, horiz_data_sm, pio_encode_pull(false, false));
// out y, 32 (also clears OSR)
pio_sm_exec(pio, horiz_data_sm, pio_encode_out(pio_y, 32));

After loading X and rising BSP, the state machine delays one cycle and then grabs data from the TX FIFO as it rises BCK, then delays another cycle. In fact, almost all of the horizontal-data state machine code has a 1-cycle delay, because it's running at 2x the speed it should be to make up for the slow wait. Using forced instructions was the only way I found to get all this code to fit within the 32-instruction memory of a single PIO block. Every time we want to send data to the screen, we have to run these instructions to reset counters. It's not a big deal, but it adds a little complexity.

out pins, 8       side 0b11 [1] ; hold BCK1, BSP still high, set data out
nop               side 0b01 [1] ; fall BCK1, BSP still high
loop:
out pins, 8       side 0b00     ; fall BSP, next data out, middle of BCK2
jmp !x, exit      side 0b00     ; exit the loop if it's the last iteration (data goes to 0 on BCK121)
nop               side 0b10 [1] ; rise BCK
out pins, 8       side 0b10 [1] ; hold BCK, data out
jmp x--, loop     side 0b00 [1] ; fall BCK, jump

Finally, we can start sending data! 8 bits of data are shifted out (there are only 6 pins mapped to the out instruction, but each 6-bit value is stored as a byte in memory, so the upper two bits are discarded) while BSP and BCK are still high, then BCK falls, then we enter the main data loop. In other words: before the loop, we have already taken care of BCK h/l 1---see the diagram above, which helpfully numbers h/ls.

Terminology Note

h/ls of signals are numbered in the LS021 datasheet. The diagram above shows BCK1, BCK2, and so on, to BCK124. This terminology just references this h/l index as shown in some part of the datasheet.

In the data loop, BCK falls, then data is shifted out, then BCK rises, then we shift data again. Within the loop, we also check if it's time to exit, which helps with a small code size optimization, and we loop until X reaches zero. This loop is why we copied ISR to X at the start.

After the loop, which runs 60 times (see the FIFO push code above), we have to finish BCK h/ls 121, 122, 123, and 124. This is just a handful of instructions:

exit:
nop               side 0b10 [1] ; rise BCK121
mov pins, null    side 0b10 [1] ; set data pins to zero
nop               side 0b00 [3] ; fall BCK122 and hold for all of 122
nop               side 0b10 [3] ; rise BCK123
jmp y--, restart  side 0b00 [1] ; fall BCK124, reach middle, restart

We rise BCK121, fall all six data pins, then complete the rest of BCK and jump to the restart: label to repeat the whole process. This will continue for as long as there is data---in reality, this means the state machine will produce signal for as long as a DMA stream provides it data. After that, it will stall until acted upon by an outside force. Note in the diagram above that there is no pause between BCK1-124 sequences: after BCK124, BCK1 rises immediately after as if it were merely just another BCK pulse. This continues for the rest of the frame, which means that once this state machine has been started, it will continue to send data (synchronized with the other signals, because they're derived from the same clock) until the stall occurs.

Vertical signals

That was about two-thirds of the signals necessary to drive the LS021. Let's look now at INTB, GSP, and GCK. We'll do the same approach, stepping through PIO code.

.pio_version 0

.program sharpie_vertical
.side_set 3

; side-set: GCK, GSP, INTB
; setting side-set pin is setting the pin mapped
; by the least-significant bit of the side-set value

.wrap_target

wait 1 irq 0   side 0b000      ; wait for CPU to set PIO IRQ 0
pull           side 0b001      ; rise INTB, pull counter (CPU put value in FIFO)
mov x, osr     side 0b011 [1]  ; rise GSP one PIO cycle after INTB, then delay another cycle and move OSR to X
nop            side 0b111 [3]  ; rise GCK1, wait all of GCK1, set IRQ 2 at GCK1 rise for horiz-data code
irq set 2      side 0b011 [1]  ; fall GCK1, wait until halfway through GCK2
irq set 1      side 0b001 [1]  ; fall GSP, wait until end of GCK2, set IRQ 1 at halfway through GCK2
loop:
jmp !x,exit    side 0b101 [3]  ; rise GCK3,4,etc, and leave the loop if the counter is zero (last iteration)
jmp x--,loop   side 0b001 [3]  ; fall GCK3,4,etc and jump based on X 

exit:
nop            side 0b001      ; fall GCK645
nop            side 0b000 [2]  ; fall INTB at 1/4 through GCK646 and hold rest of 646
nop            side 0b100 [3]  ; rise GCK647 (or last)
nop            side 0b000 [3]  ; fall GCK647

.wrap

#include "hardware/gpio.h"
static inline void sharpie_vertical_pio_init(PIO pio, uint sm, uint offset, uint intb_pin) {
  pio_gpio_init(pio, intb_pin);
  pio_gpio_init(pio, intb_pin + 1);
  pio_gpio_init(pio, intb_pin + 2);

  pio_sm_set_consecutive_pindirs(pio, sm, intb_pin, 3, true); // set all pins output
  
  pio_sm_config c = sharpie_vertical_program_get_default_config(offset);

  // set side-set pins starting from intb_pin
  sm_config_set_sideset_pins(&c, intb_pin);
  sm_config_set_clkdiv(&c, 3100); // 150 MHz / 3100

  pio_sm_init(pio, sm, offset, &c);
  pio_sm_set_enabled(pio, sm, true);
}

Starting with the C code: this code should be on state machine 0, on the PIO already used for the horizontal/data code. The only pins accessed by the state machine are INTB, GSP, and GCK, and they're all on side-set. SM0 has a clock divider of 3100, for an instruction time of 20.66 μs, or 1/4 of the GCK h/l period. The BCK-derived GCK h/l in this system is 82.64 μs, on the lower end of the allowed range but still within spec.

This is the timing diagram for the entire set of LS021 signals:

We have already seen how to generate the "Data" and "Horizontal Control Pulse" (see, even Sharp can't avoid the word "horizontal"). The signals that remain are INTB, GSP, GCK, and GEN.

Now the PIO code:

wait 1 irq 0   side 0b000      ; wait for CPU to set PIO IRQ 0
pull           side 0b001      ; rise INTB, pull counter (CPU put value in FIFO)
mov x, osr     side 0b011 [1]  ; rise GSP one PIO cycle after INTB, then delay 
                               ; another cycle and move OSR to X

SM 0 also waits for an IRQ, but this one is set by the CPU when the CPU has all other components prepared. After IRQ 0 is set, SM 0 pulls one value from the TX FIFO at the same time it rises INTB. For this state machine, we don't use forced instructions to load the counter register, because we are already fiddling with so many pins---it's more practical simply to have the main code load the counter.

After INTB rises, GSP rises thsGSP (1/4 GCK h/l) later, which happens at the same time the OSR is moved to X. Now GCK clocking can begin.

nop            side 0b111 [3]  ; rise GCK1, wait all of GCK1, set IRQ 2 at GCK1 rise for horiz-data code
irq set 2      side 0b011 [1]  ; fall GCK1, wait until halfway through GCK2
irq set 1      side 0b001 [1]  ; fall GSP, wait until end of GCK2, set IRQ 1 at halfway through GCK2
loop:
jmp !x,exit    side 0b101 [3]  ; rise GCK3,4,etc, and leave the loop if the counter is zero (last iteration)
jmp x--,loop   side 0b001 [3]  ; fall GCK3,4,etc and jump based on X

GCK1 rises, and the state machine waits until it has to fall. When it falls, it sets IRQ 2 for SM 2, to begin the horizontal/data process. Then, at halfway through GCK2, GSP falls and the state machine sets IRQ 1 to configure the GEN state machine, which we'll see in a bit. By now, GCK1 and GCK2 have both passed, and the main loop can start. This loop is simply rising and falling GCK, while keeping INTB high, until the X register reaches zero. Note again that we're using a check for X==0 to exit the loop after GCK rises for the last time in the loop.

exit:
nop            side 0b001      ; fall GCK645
nop            side 0b000 [2]  ; fall INTB at 1/4 through GCK646 and hold rest of 646
nop            side 0b100 [3]  ; rise GCK647 (or last)
nop            side 0b000 [3]  ; fall GCK647

After the loop, SM 0 falls GCK645, then falls INTB 1/4 GCK h/l later. In the diagram above, INTB falls at that strange interval---it's not on a GCK edge or halfway through an h/l. This is why SM 0 has a 1/4 clock just like SM 2 before.

After INTB falls, SM 0 holds for the rest of GCK646, then rises GCK647, then falls GCK647, and the frame is done! There's only one state machine left to look at, and it's dead simple.

Generating GEN

I mentioned before that GEN wants to run on a different clock than any other signal. In the whole-frame diagram above, you can see that GEN pulses 640 times, centered in each GCK h/l. The datasheet is cagey about GEN and only specifies a minimum high time and a minimum setup and hold time. The minimum setup/hold time (on either side of the GEN pulse, between the edges of a GCK h/l) is 16.37 μs, or 1/5 of a GCK h/l. This means that the remaining 3/5 of the time for each GCK h/l is spent with GEN high.

These numbers don't make much sense, and I failed to find an integer divider to handle this weird fifths situation. Luckily, if we use the same clock as SM 0, we can create setup/hold times of 20.66 μs (within spec) and high period times of 41.33 μs (also within spec).

Warning

Remember that I haven't tested this with a real LS021. The GEN pulses are the most spec-questioning part of the output these state machines generate, so they might not work. In fact, this entire post might be me talking out of my ass.

Here's the code:

.pio_version 0

.program sharpie_gen
.side_set 1

; side-set: GEN
; this program is separate because it uses a distinct counter

.wrap_target
; the counter in X will be loaded by forced instructions from the CPU
wait 1 irq 1   side 0 [1]   ; wait for IRQ 1 (halfway through GCK 2)
label:
nop            side 1 [1]   ; rise GEN
jmp x--, label side 0 [1]   ; fall and jump

.wrap

#include "hardware/gpio.h"
static inline void sharpie_gen_pio_init(PIO pio, uint sm, uint offset, uint gen_pin) {
  pio_gpio_init(pio, gen_pin);

  pio_sm_set_consecutive_pindirs(pio, sm, gen_pin, 1, true); // set GEN pin output

  pio_sm_config c = sharpie_gen_program_get_default_config(offset);
  
  sm_config_set_sideset_pins(&c, gen_pin);
  sm_config_set_clkdiv(&c, 3100);

  pio_sm_init(pio, sm, offset, &c);
  pio_sm_set_enabled(pio, sm, true);
}

This code should be loaded into state machine 1 on the same PIO. After loading and initialization, the CPU will load the X register with a counter value. SM 1 waits for IRQ 1, then raises and lowers GEN via side-set and delays until the loop ends and the program wraps. That's it! With this PIO code, and a little bit of C to configure the system, we can generate a complete LS021 video frame.

The homestretch

I promise, this is almost the end. The CPU is responsible for tying this entire system together. This assumes that there's already a 320*240 = 76800 byte framebuffer somewhere in memory, accessible by DMA, and that the system clock is at 150 MHz. The code, truncated, is shown below. I will not walk through this but I will add some commentary below.

PIO pio = pio0;
uint vertical_sm = 0;
uint gen_sm = 1;
uint horiz_data_sm = 2;
  
uint offset = pio_add_program(pio, &sharpie_vertical_program);
if (offset < 0) {
  printf("failed to add sharpie_vertical_program");
  error_handler();
}
  
/**** Initialize SM 0: ****/
// INTB on pin 0, GSP on pin 1, GCK on pin 2
sharpie_vertical_pio_init(pio, vertical_sm, offset, 0);
  
// the number you put on the FIFO is the number of times the loop
// will run minus 1
pio_sm_put(pio0, vertical_sm, 321); // run 321 times for 648 h/l total


/**** Initialize SM 1: ****/
offset = pio_add_program(pio, &sharpie_gen_program);
if (offset < 0) {
  printf("failed to add sharpie_gen_program");
  error_handler();
}
// GEN on pin 3, start state machine
sharpie_gen_pio_init(pio, gen_sm, offset, 3);


/**** Initialize SM 2: ****/
offset = pio_add_program(pio, &sharpie_horiz_data_program);
if (offset < 0) {
  printf("failed to add sharpie_horiz_data_program");
  error_handler();
}

// BSP on pin 4, BCK on pin 5, data from pin 6 to 11 inclusive
sharpie_horiz_data_pio_init(pio, horiz_data_sm, offset, 4, 6);


// Configure counters for GEN state machine and horiz-data state machine
// This is all done through forced instructions---you saw this earlier!

// GEN: run 5 times (counter value + 1)
pio_sm_put(pio, gen_sm, 639); // this should be 639 for 640 high pulses
pio_sm_exec(pio, gen_sm, pio_encode_pull(false, false)); // just a basic pull
pio_sm_exec(pio, gen_sm, pio_encode_mov(pio_x, pio_osr)); // mov x, osr
// GEN counter is now charged

// horiz-data: charge X, make a backup in ISR (unused by any other part of the code)
pio_sm_put(pio, horiz_data_sm, 59); // we'll get a total of 2(x+1)+4 h/l so this should be 59 => 2(59+1)+4 = 124
pio_sm_exec(pio, horiz_data_sm, pio_encode_pull(false, false));  // pull
pio_sm_exec(pio, horiz_data_sm, pio_encode_out(pio_isr, 32)); // out isr, 32 (make backup of counter value and clear OSR for autopull)
pio_sm_exec(pio, horiz_data_sm, pio_encode_mov(pio_x, pio_isr)); // mov x, isr (load X with counter)
// charge Y for total loop counter
pio_sm_put(pio, horiz_data_sm, 640); // should be 640 for 641 loops (see 6-3-2, the last loop has data all zeros, which is why we have a chained DMA channel below)
pio_sm_exec(pio, horiz_data_sm, pio_encode_pull(false, false)); // pull
pio_sm_exec(pio, horiz_data_sm, pio_encode_out(pio_y, 32)); // out y, 32 (also clears OSR)

int dma_channel = dma_claim_unused_channel(true); // true -> required
if (dma_channel < 0) {
  printf("failed to claim dma channel\n");
  error_handler();
}
// DMA chaining is instantaneous, this does not affect timing.
int dma_channel_zero = dma_claim_unused_channel(true);
if (dma_channel_zero < 0) {
  printf("failed to claim dma zero channel\n");
  error_handler();
}

// configure DMA AFTER we charge the loop registers
dma_channel_config c = dma_channel_get_default_config(dma_channel);
channel_config_set_read_increment(&c, true); // increment reads (from the formatted framebuffer)
channel_config_set_write_increment(&c, false); // no increment writes (into the FIFO)
channel_config_set_transfer_data_size(&c, DMA_SIZE_32); // four byte transfers (one byte doesn't work)
channel_config_set_dreq(&c, pio_get_dreq(pio, horiz_data_sm, true)); // true for sending data to SM
channel_config_set_chain_to(&c, dma_channel_zero); // chain to zero channel to start zero channel when this finishes
dma_channel_configure(dma_channel, &c,
		&pio->txf[horiz_data_sm], // destination (TX FIFO of SM 2)
		formatted_framebuffer_32, // source (formatted framebuffer)
		19200, // transfer size = 320*240/4 = 19200
		true); // start now

// configure zero channel, which starts when the main framebuffer
// channel completes.
dma_channel_config c_zero = dma_channel_get_default_config(dma_channel_zero);
channel_config_set_read_increment(&c_zero, false); // just send zeros
channel_config_set_write_increment(&c_zero, false); // write into FIFO
channel_config_set_transfer_data_size(&c_zero, DMA_SIZE_32);
channel_config_set_dreq(&c_zero, pio_get_dreq(pio, horiz_data_sm, true));
dma_channel_configure(dma_channel_zero, &c_zero,
		&pio->txf[horiz_data_sm],
		&global_32bit_zero, // this is a 32-bit zero in RAM
		240/4, // 240 bytes but 4-byte transfers
		false); // wait for chain start
  
// all state machines have to be running in lockstep for
// synchronization via interrupts to work
pio_clkdiv_restart_sm_mask(pio, 0b111); // restart SM 0, 1, and 2

// and set IRQ 0 to start the frame generation!
pio0->irq_force = 0b1;

The code's job is to load PIO programs, point state machines at these programs, configure two DMA channels, restart all state machines, and set IRQ 0. The two DMA channels are there so that we can match the signal diagram, shown below again for good measure:

You may not have noticed the first time, but look at the bottom-most "Horizontal Control Pulse": it lasts for one GCK h/l longer than the data. In that final GCK h/l, only BSP and BCK should be clocked, while the data stays low. Chaining the main framebuffer DMA stream to another non-incrementing stream lets us send just 240 zeros to SM 2.

Both DMA channels transfer 32 bits at a time, because pixel data is stored as 6 bits in an 8-bit int, as mentioned. 32-bit transfers reduce bus traffic. The first stream is a basic memory-to-peripheral, where the read address increments but the write address does not, and the second stream is memory-to-peripheral with no increment, so that it always writes zero.

After all the configuration is done, the CPU restarts the clock dividers for SMs 0, 1, and 2 so that they run in sync. This has to be done so that setting an IRQ in one place and listening for it in another place can create consistent, deterministic output.

Then, the CPU sets IRQ 0 with the IRQ_FORCE register, and the state machines run!

Verifying the output

sweet christ is it almost over

yeah

After verifying up-close timing with an oscilloscope, I pulled out my SparkFun-branded logic analyzer and PulseView to capture a complete frame:

And here's the start and half of a line:

All data being written is 1s, from a buffer in RAM set entirely to 0xff. Compare the graph to the diagrams above---although the PulseView timing thing is disabled for these screenshots, I have confirmed multiple times that all signals are being clocked correctly.

Here's the end of a frame:

Look at how there is one last BSP rise and set of BCK pulses, with no data, at the end of the transmission, to match the datasheet diagram. The time between this pulse and INTB falling is also in spec---worth checking!

With the PulseView "Counter" decoder, I also checked the number of pulses, and they all match: 648 h/l for GCK, 641 pulses for BSP, and 124*641 = 79484 h/l for BCK.

All these samples were recorded at 24 MHz, the maximum sample rate of my logic analyzer. Any slower and you lose the ability to accurately measure signal intervals.

Conclusion to Part 1

That's all! In theory, we can generate a complete LS021 video frame with zero CPU intervention. Part 2 will probably be hardware design for the Sharpie board, and this time there really will be a Part 2.