Bringing Multicore To The Arduino World With ShieldBuddy TC275Follow article
OK, the Shieldbuddyis basically just another Arduino but is it? While it looks a bit like an Arduino MEGA or Due, you might notice that the CPU is a bit bigger (176 pins) and the PCB is red for danger. Certainly the connector arrangement is the same and it works with the Arduino IDE. However what is not obvious is that the processor is running at 200MHz and hidden inside the LQFP package there are in fact three of them, along with 4MB of FLASH, 128kb of data flash and 500k of RAM.
Most Arduino-style boards use AVR or ARM/Cortex processors which are fine for basic messing about with micros - these chips are everywhere in consumer gadgets and devices. The ShieldBuddy is different, having the mighty Infineon Aurix TC275 processor. These are normally only to be found in state of the art engine management systems, ABS systems and industrial motor drives in your favourite German car. They rarely make it out into the daylight of the normal hobbyist/maker world and to date have only been known to a select few at Bosch, BMW, Audi, Daimler-Benz etc..
Hitex UK decided to change all that and bring this awesomely powerful chip to a wider audience. A few years ago we had a placement student with us who was using the Arduino Uno to control 4 servos in a model aeroplane. Unfortunately the combination of real time servo control and serial comms to the RC receiver was overloading the 16MHz AVR Atmega 328p, resulting in a lot of grief and frustration. To fix this problem and give more processing power than he could ever need, we decided to drop the Aurix TC275 from a confidential automotive project we had underway onto an Arduino Due format board. The ShieldBuddy was born and the placement student’s problems were solved using just one CPU core. The other 2 were just twiddling their thumbs. Having done the equivalent of putting a 27 litre V12 in a Fiat 500, we went off in search of other challenges. So what exactly is in the TC275 and what makes it so powerful?
Essentially we have three near-identical 200MHz 32-bit CPU cores on a shared bus, each with their own local RAM but sharing a common FLASH ROM. The peripherals (timers, port pins, Ethernet, serial ports etc.) are also shared, with each core having full access to any peripheral.
The TC275 CPU core design has a basic 5ns cycle time which means you can get typically around 150 to 200 32-bit instructions per microsecond. This is seriously fast when you consider that the Arduino Uno’s Atmega328P only manages around sixteen 8-bit instructions/us! In addition, there is a floating point unit on each core so using floating point variables does not slow things down significantly.
With so much computing horsepower available, the TC275 can manage a huge range of peripherals. Besides commonplace peripherals like CAN, ADC, I2C, Ethernet, SPI etc. the TC275 has possibly the most powerful signal measurement and generation block to be found on any microcontroller (GTM) plus a an advanced super-fast delta-sigma analog to digital converter.
The Generic Timer Module (GTM) is the main source of pulse generation and measurement functions containing over 200 IO channels. It is designed primarily for automotive powertrain control and electric motor drives. Unlike conventional timer blocks, time-processing units, CAPCOM units etc. it can work in both the time and angle domains without restriction. This is particularly useful for mechanical control systems, switch-reluctance motor commutation, crankshaft synchronisation etc.
Under the bonnet the GTM has around 3000 SFRs but fortunately you do not need to know any of these to realize useful functions! It is enormously powerful and the culmination of 25 years of meeting the needs of high-end automotive control systems. However it can and indeed has been successfully applied to more general industrial applications, particularly in the field of motor control where is can drive up to 4 three-phase motors. The Arduino analogWrite() function makes use of it in a simple way to generate PWM. It can also flash a LED. There is a second timer block (GPT12) can be used for encoder interfaces. Usefully most port pins can generate direct interrupts.
With 176 pins required to get these peripherals out and only 100 pins on the Arduino Due form factor, some functions have had to be limited. The 32 ADC channels have been limited to 12 and the 48 potential PWM channels are also limited to 12, although more channels can be found on the double row expansion connector, if needed.
The View From The ShieldBuddy Driving Seat
So with 3 cores and loads of peripherals, how do you actually program it?
The standard Arduino IDE can be used, provided that the ShieldBuddy add-in has been installed. Programs can be written in exactly the same way as on an ordinary Arduino. However to make best use of the multicore TC275 processor, there are some specially implemented macros and functions available.
The Arduino IDE has been extended to allow the generation of Aurix instructions using the Hightec GCC compiler, available for free download. Anybody used to the default Arduino sketch might notice though that in addition to the familiar setup() and loop(), there is now a setup1(), loop1() and setup2(), loop2(). These new functions are for CPU cores 1 and 2 respectively. So while Core0 can be used as on any ordinary Arduino, the lucky programmer can now run three applications simultaneously.
Core0 can be regarded as the master core in the context of the Arduino as it has to launch the other two cores and then do all the initialisation of the Arduino IO, timer tick (for millis() and micros() and delay() etc.). Thus setup1() and setup2() are reached before setup()!
Although all three cores are notionally the same, in fact cores1 and 2 are about 25% faster than core0 as they have an extra pipeline stage. Thus it is usually best to put any heavyweight number crunching tasks on these cores.
Writing for a multicore processor can be a bit mind-bending at first. The first thing to realise is that there is only one ROM and the Arduino IDE just compiles source code. It has no idea (and does not need to know) which core a particular function will run on. It is only when the program runs that this becomes fixed. Any function called from setup and loop() will run on core0; any called from setup1() and loop1() will execute on core1 and so on. Thus is perfectly possible for the same function you wrote to execute simultaneously on all three cores. As there is only one image of this function in the FLASH, the internal bus structure of the Aurix allows all three cores to access the same instructions at the same addresses (worst case) at exactly the same time. Note that if this extreme case happens, there will be a slight loss of performance.
Sharing of functions between cores is easy, provided that they do not make use of the peripherals! Whilst there are three cores, there are only two ADCs. If all three cores want to access the same ADC result register, there is no particular problem with this. However if you want a timer to generate an interrupt and call a shared function, then that function might need to know which core it is currently running on! This is easy to do as there is a macro defined to return the core number.
if(GetCpuCoreID() == 2)
/* We must be running on core 2! */
Fortunately it is rare to have to do this but it is used extensively in the ShieldBuddy to Arduino translation layer.
One of the aims of the AURIX multicore design is to avoid the awkward programming issues that can arise in multicore processors and make the system architect’s job easier. The three independent cores exist within a single memory space (0x00000000 – 0xFFFFFFFF), so they are all able to access any address without restriction. This includes all the peripherals and importantly all FLASH and RAM areas.
Having a single global address space when accessing RAM can considerably ease the passing of data between cores using shared structures. Supporting high performance when doing this is achieved by the implementation of a crossbar bus system to connect the cores, memories and DMA systems. Of course there are protection mechanisms that can generate traps for such accesses if the application requires it, as they may indicate a program malfunction which would need to be handled in an orderly manner.
The upshot of this is that the programmer does not need to worry about cores accessing the same memory location (i.e. variable) at the same time. In some multicore processors (e.g. LPC4285) this would cause an exception and is regarded as an error. Certainly if you are new to multicore programming, this makes life much easier. Of course there could be a contention at the lowest level and this can result in extra cycles being inserted but given the speed of the CPU, this is unlikely to be an issue with Arduino-style applications.
With an application split across three cores, the immediate problem is how to synchronise operations. As the Aurix design allows RAM locations to be accessed by any core at any time, this is no problem. In the simplest case, global variables can be used to allow one core to send a signal to another. As an example if we might want to use the SerialASC.print() (equivalent to Arduino Serial.print()) function to allow each core to send a message to the Arduino Serial Monitor – something like “Hello From Core 0”, “Hello From Core 1” etc.. However this simple approach is unreliable and in this case can sometimes result in jumbled messages being printed.
What we need to do is make sure that each core waits in turn for the other cores to finish writing to the serial port. On the face of it this is quite easy using some global variables. However with true multicore programming, weird things can happen that don’t occur in single core and the obvious approach such as having a global variable that tells everybody whether the SerialASC port is being used may not work. The problem is that other cores can do anything at any time relative to each other. To solve this tricky problem the new uint32 Htx_LockResource(uint32 *ResourcePtr) function is used. This allows a peripheral or global variable to be “claimed” by a core and be inaccessible to others. To support this, the ShieldBuddy serial port classes have been extended by adding a “PortInUse” variable so that multicore support is now built in.
Another way is to get one core to create an interrupt in another core to tell it to do something. The Arduino language has been extended to allow you to trigger an interrupt in another core. This means that core 0 can trigger an interrupt in say core 1. That interrupt might tell Core 0 that a resource is now free or perhaps tell it to go and read a global variable that core0 has just updated.
/* Create an interrupt in core 1 */
Here Core1IntService is a function written by the user that Core 1 will execute when Core 0 requests it to do so. Functions are available to let any core request an interrupt function to run in any other core.
Multicore Memory Support
The Arduino IDE gives no clue as to which address any variable goes or even what memory is available. If you are not bothered about execution speed or are only using Core 0, then variables can be declared just as in any other Arduino board. However if you are using Cores1 & 2, having some idea how the physical memory is arranged inside the TC275 can make a huge difference to the maximum performance that can be obtained. A global variable declared in the usual way will end up in the Core 0 SRAM (“DPSR0”). If this is only used by Core0 then the access time will be very fast. This is because each of the RAMs appears at two addresses in the memory map. Core0’s DSPR RAM appears to be at 0xD0000000 where it is considered to be local and is directly on Core0’s local internal bus. It is also visible to the other cores at 0x70000000 so that they can read and write it freely. The penalty is that the access will be via a bus system that all cores can access (the SRI) which unfortunately is much slower and can be influenced by other traffic between cores. Thus all the cores have local RAM that is visible to the other cores, albeit at reduced speed. There is a fourth RAM area (“LMU”) which is not tied directly to any core and which all cores have fast access to. This is useful for shared variables that are heavily used by all cores.
As cores 1 & 2 are the fast cores, it makes sense to put their variables into their local RAMs but as standard, the Ardunio IDE has no support for this. For the ShieldBuddy, a series of ready-made macros are available that allow you to put variables into any of these SRAM areas easily.
Using these macros for core 1 and 2 data will give a significant increase in performance and is highly recommended.
What To Do With Three 200MHz 32-bit Cores?
For new programmers, it is probably best to stick to just core 0 and treat the ShieldBuddy like an ordinary but very fast Arduino Uno. You can develop your application on just one core. If it starts to get too big or if you have finished one complete functional unit you can move it onto another core and then start on the next bit on core 0. If you have a big system that runs on more than one Arduino Uno, you can run the individual programs on separate cores in the ShieldBuddy. You could just set up each 32-bit, 200MHz core to flash the user LED in sequence for no particular reason.
Another approach when starting out on what could be a major program development is to try and split the software into three parts and then run each part on its own core. You might have an application which needs to drive an LCD graphics screen (e.g. the TFT Touch Shield), run some motors and perform a lot of serial communications like UART and Ethernet. Here each of these blocks can have their own core and communicate using the methods given earlier.
In existing applications where the software is split across a number of Arduino Unos, the software from each board could be merged and be run on the three cores in the ShieldBuddy.
Perhaps not a mainstream application but the ShieldBuddy TC275 has the distinction of holding the world record for a machine solution to the famous Rubik's cube at just 637 milliseconds! The previous record holder used an Arduino Uno. The original Arduino program was dropped into the ShieldBuddy's extended Arduino IDE, resulting in a huge speed increase.
For serious users, the programming approach required for multicore can be experimented with and learned. With an increasing number of multicore processors around these days, being able to fiddle with multicore programming in a friendly Arduino-based environment is very useful. In higher education, the ShieldBuddy is a great platform for teaching as it spans the first attempts to flash an LED on one core up to PhD-level software engineering on all three cores, all using just one set of tools.
I Want A ShieldBuddy!
To get your hands on a ShieldBuddy now, it is now available at RS Stock No