If you don’t need the additional GPIO’s on Arduino Mega, you can instead use Teensy adapter for more memory and faster speeds.
This blog talks about adding 128 KBytes to Arduino Mega w/ SPI and some of learnings thru the process.
SPI-RAM (23LC1024, 128KBytes, SPI)
I chose Arduino Mega because of its 5V GPIO’s but it has only 8KB RAM. My goal was to keep the design simple (no level shifters) so I decided to solve the memory problem later on. That time has come now; I want to run OS’s and want to design the RetroShield 68008.
- If I had known earlier, it is possible to add parallel RAM to Arduino Mega as external memory and modify Atmel compiler to use that. Slightly complicated but doable. It’s too late for us because I had to use those pins for CPU signals.
- Add Parallel Memory to RetroShield. I was serious about this earlier and added the two rows of header pins along the CPU pins so one could design a board that plugs into RetroShield. At that time I didn’t know how to support all processors with one PCB design but that was a problem I was going to solve later on.
- Parallel RAM connected to Arduino and bitbang-it. Requires many gpio’s (8bit data, 16~18bit address, 2 control).
- SPI-RAM connected to Arduino.
Last option was simplest and cheapest:
- 23LC1024 from Microchip has 128 KBytes of RAM.
- It uses SPI interface and supports 1-, 2-, 4-bit SPI modes.
- Bitbanging 4-bit SQI mode is fast enough. right-most mode in the picture.
Commands required to read/write from memory.
Wtih sequential-mode (enabled by default), after you send the initial address, you can continue to clock data bytes back to back and 23LC1024 will auto-increment the address pointer. This is very handy for page reads/writes during cache operations.
You can also switch to SQI (4-bit) mode. In this mode, we send 8-bits of data in two clock cycles. Note that dummy byte read after address. You don’t have to do this for SQI write transactions.
Implementing a Cache
As you notice, even though we use 4-bit mode, read/writing a byte from SPI-RAM is expensive because we have to send 4 bytes (cmd + 24bit address) for 1 byte. Doing this for every CPU read/write will be slow. We can implement a simple cache to speeds things up.
A cache is basically a small fast memory area where we fetch data from slower memory and hope the CPU will be accessing this fast memory frequently instead of the slow memory. If things go well, average-wise memory looks fast to CPU.
Since the size of this fast memory is much smaller than the slow memory, we bring data in “blocks” or “pages” and keep track of which page is in cache or not.
If CPU is trying to access a memory location, we check if the page containing that address is in cache. If the page is in the cache, we will access the page in cache, (cache-hit). If the page is not in the cache, then we need to bring the page from SPI-RAM to the cache page and then complete the access, (cache-miss). Expectation is the ratio of cache-hits will be higher than cache-misses.
There are couple of optimization parameters for an effective cache:
- Page Size: Page size defines the size of the block copied back and forth between the memory and the cache. Larger page sizes might mean higher probability of cache-hits, but also higher penalty for cache-misses (takes longer to copy page from memory). Too small page sizes might reduce cache effectiveness due to higher number of cache misses and more memory transactions.
- Cache size: more cache is better from a hit point of view, but more cache means we will spend more time checking if the cache contains the address CPU trying to access. We can parallelize this search with hardware (as most modern CPU’s do) but with Arduino, searching the cache means for-loop :) So we need to watch this.
- Write-Policy: there is the question of write-thru or write-back mode. During a memory write, do you write the data to slow-memory immediately or not? Advantage of not writing immediately is speed, but if we delay writes, then the page will become dirty and we need to write the whole page back to memory at some point (or incure memory loss/conflicts if power is lost and/or if multi-processor accesses to same memory).
- Cache Policy: This is the key algorithm to pick how to replace cache pages w/ data from slow memory. If not done right, you will be replacing pages unnecessarily wasting time. One method is Least-Recently-Used (LRU) in which you keep track of hits on each page and discard the least used one.
Tuning of these parameters will depend on the program you are runing and its memory access pattern.
This is a good write-up that explains cache concepts.
On Arduino, I initially chose Direct-Mapped Cache because it is easy and a good starting point. Direct-Mapped Cache uses part of the address bits to map to a fixed cache location and uses rest of the address as tag.
Let’s go over the code together:
////////////////////////////////////////////////////////////////////
// Cache for SPI-RAM
////////////////////////////////////////////////////////////////////
byte cacheRAM[16][256];
byte cachePage[16];
So,
- cache contains 16 pages, each being 256 bytes long, cacheRAM.
- cachePage keeps track of which address is copied into the corresponding cache page.
Let’s see how we read using cache:
inline __attribute__((always_inline))
byte cache_read_byte(word addr) // 0x1234
{
byte a = (addr & 0xFF00) >> 8; // a = 0x12
byte p = a >> 4; // p = 0x01
byte n = a & 0x0F; // n = 0x02
byte r = (addr & 0x00FF); // r = 0x34
if (cachePage[n] == p)
{
// Cache Hit !!!
return cacheRAM[n][r];
}
else
{
// Need to fill cache from SPI-RAM
digitalWrite2f(LED2, HIGH);
spi_read_byte_array_quad(0, addr & 0xFF00, 256, cacheRAM[n]);
cachePage[n] = p;
digitalWrite2f(LED2, LOW);
return cacheRAM[n][r];
}
}
Looking at code above, we split the 16-bit address into three pieces: p, n, r. For example, address of 0x1234 becomes p=0x1
, n=0x2
, r=0x34
.
We use n
to find the corresponding cacheRAM page, cacheRAM[n][...]
. r
points to byte in that page, cacheRAM[n][r]
. Last, we use p
as tag, which shows what address is saved in this cache area, and saved under cachePage[n] = p
.
If cachePage[n] == p
, then we have the page in the cache, hence a cache-hit. Otherwise, cache-miss which results in copying data from SPI-RAM.
cache_write_byte is the same concept except we write the data to SPI-RAM immediately. (I will experiment with dirty caches later on).
inline __attribute__((always_inline))
void cache_write_byte(word addr, byte din) // 0x1234
{
byte a = (addr & 0xFF00) >> 8; // a = 0x12
byte p = a >> 4; // p = 0x01
byte n = a & 0x0F; // n = 0x02
byte r = (addr & 0x00FF); // r = 0x34
if (cachePage[n] == p)
{
// Cache Hit !!!
cacheRAM[n][r] = din;
spi_write_byte_quad(0, addr, din); // Write-thru cache :)
return;
}
else
{
// Need to fill cache from SPI-RAM
digitalWrite2f(LED1, HIGH);
spi_write_byte_quad(0, addr, din);
spi_read_byte_array_quad(0, addr & 0xFF00, 256, cacheRAM[n]);
cachePage[n] = p;
digitalWrite2f(LED1, LOW);
return;
}
}
void cache_init()
{
// Initialize cache from spi-ram
for(int p=0; p<16; p++)
{
cachePage[p] = 0;
}
Serial.println("RAM Cache - Initialized.");
}
The code is checked in Gitlab repository.
That’s all folks.