Sign up & Download
Sign in

SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip

by A D Reid, K Flautner, E Grimley-Evans, Yuan Lin
Channels ()

Abstract

The architectures of system-on-chip (SoC) platforms found in high-end consumer devices are getting more and more complex as designers strive to deliver increasingly compute-intensive applications on near-constant energy budgets. Workloads running on these platforms require the exploitation of heterogeneous parallelism and increasingly irregular memory hierarchies. The conventional approach to programming such hardware is very lowlevel but this yields software which is intimately and inseparably tied to the details of the platform it was originally designed for, limiting the software's portability, and, ultimately, the architectural choices available to designers of future platform generations. The key insight of this paper is that many of the problems experienced in mapping applications onto SoC platforms come not from deciding how to map a program onto the hardware but from the need to restructure the program and the number of interdependencies introduced in the process of implementing those decisions. We tackle this complexity with a set of language extensions which allows the programmer to introduce pipeline parallelism into sequential programs, manage distributed memories, and express the desired mapping of tasks to resources. The compiler takes care of the complex, error-prone details required to implement that mapping. We demonstrate the effectiveness of SoC-C and its compiler with a "software defined radio" example (the PHY layer of a Digital Video Broadcast receiver) achieving a 3.4x speedup on 4 cores.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

SoC-C: efficient programming abst...

SoC-C: Efficient Programming Abstractions for Heterogeneous Multicore Systems on Chip Alastair D. Reid Krisztian Flautner Edmund Grimley-Evans ARM Ltd Yuan Lin University of Michigan ABSTRACT The architectures of system-on-chip (SoC) platforms found in high-end consumer devices are getting more and more complex as designers strive to deliver increasingly compute-intensive ap- plications on near-constant energy budgets. Workloads running on these platforms require the exploitation of heterogeneous par- allelism and increasingly irregular memory hierarchies. The con- ventional approach to programming such hardware is very low- level but this yields software which is intimately and inseparably tied to the details of the platform it was originally designed for, limiting the software���s portability, and, ultimately, the architec- tural choices available to designers of future platform generations. The key insight of this paper is that many of the problems experi- enced in mapping applications onto SoC platforms come not from deciding how to map a program onto the hardware but from the need to restructure the program and the number of interdepen- dencies introduced in the process of implementing those decisions. We tackle this complexity with a set of language extensions which allows the programmer to introduce pipeline parallelism into se- quential programs, manage distributed memories, and express the desired mapping of tasks to resources. The compiler takes care of the complex, error-prone details required to implement that mapping. We demonstrate the effectiveness of SoC-C and its compiler with a ���software defined radio��� example (the PHY layer of a Digital Video Broadcast receiver) achieving a 3.4x speedup on 4 cores. Categories and Subject Descriptors: D.3.3 [Software]: Programming Languages General Terms: Languages 1. INTRODUCTION In the next five years the peak available bandwidth to mo- bile phones is expected to increase from less than 5 Mbps today to 100 Mbps in 2012. The signal-processing through- put to implement these protocols is expected to increase to beyond 25 giga-operations per second. Commodity cameras on phones already support 10M pixel resolution which fur- ther drives the need for high-speed multimedia image pro- cessing, high-definition video coding and 3D graphics. To maintain the same form-factor, this massive performance must be achieved without increasing battery size which lim- its the power consumption to around 1 Watt. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES���08, October 19���24, 2008, Atlanta, Georgia, USA. Copyright 2008 ACM 978-1-60558-469-0/08/10 ...$5.00. Modern DSP designs are starting to achieve the required energy e���ciency. For example, ARM���s prototype data pro- cessing engine can sustain over 10 GMAC/s at less than 300mW in 65nm technology. The main problem is not cre- ating energy-e���cient hardware but creating e���cient, main- tainable programs to run on them. In order to save energy and, to some extent, silicon area, high performance embed- ded systems eschew features that characterize today���s high- end multiprocessor systems: Homogeneous processors are replaced by a heterogeneous mix of specialized processors tuned to particular parts of the expected workload General- purpose processors programmed in C, C++, etc. are supple- mented by special-purpose accelerator engines which may be fixed-function, configurable or programmable using a C sub- set Shared memory is replaced by multiple private memo- ries to decrease latency and energy and increase bandwidth and Hardware cache coherency is omitted to save area and power consumed by cache coherence protocols. Omitting these features from high performance embedded systems re- quires programmers to adopt a very low-level, error-prone programming style that limits portability and maintainabil- ity. The key insight of this paper is that these problems come not from deciding how to map the application onto the hard- ware but from the restructuring of the code and the number of interdependencies introduced in the process of implement- ing those decisions. Rather than abandon features because of their hardware cost, SoC-C moves their implementation into the language so that the programmer can reason about and optimize the mapping at a high level while the compiler takes care of the complex, error-prone details required to implement that mapping. SoC-C is a set of language extensions that enables pro- grammers to express e���cient system-on-chip programs that exploit the parallelism available in the platform, provides programmers with control over how the many different pro- cessing elements in the platforms are used, and requires lit- tle or no restructuring when the application is subsequently ported within a family of platform architectures. This paper makes the following contributions: We de- scribe channel-based decoupling: a novel spin on existing ways to automatically introduce pipeline parallelism that allows programmers to tradeoff determinism for scheduling freedom and is capable of handling the complex control flow that real applications require. We propose a novel way of ex- pressing the data copying that must happen in a distributed memory system. Our annotations express the programmer���s intent allowing the compiler to detect missing or incorrect copy operations. We describe an inference mechanism that 99
Page 2
hidden
// Data placement declaration ::= type variable @ { memory1, ... memoryn } expression ::= variable @ memory statement ::= SYNC(variable[,memory[,memory]] ) @ processor // Code placement expression ::= identifier( expression, ... expression ) @ processor // Fork-join parallelism statement ::= parallel sections { section { compound-statement } . . . section { compound-statement } } // Pipeline parallelism statement ::= pipeline { compound-statement } statement ::= FIFO ( variable ) Figure 1: SoC-C syntax extensions. significantly reduces the amount of annotation required to map an application onto a hardware platform. We identify the critical optimizations required to support the high level programming model. With these optimizations, SoC-C can achieve accelerator utilization levels of 94% and a speedup of 3.4x on a platform with 4 accelerators on a real workload. The paper is structured as follows. Section 2 describes a set of obvious minimal extensions to C to support heteroge- neous, distributed parallel systems and introduces an exam- ple to illustrate why these extensions are necessary but in- su���cient for programming complex SoCs. Thus motivated, Sections 3���6 make a series of improvements showing how each extension improves the running example and we eval- uate the expressiveness of the extensions in Section 7. Sec- tions 8 and 9 discuss optimizations and performance. Sec- tion 10 discusses related work and Section 11 concludes. This paper does not address how the best application mapping can be generated automatically using program anal- ysis, profiling, iterative compilation, etc. for two reasons. The first is that the mechanism used to choose a mapping is largely orthogonal to the mechanism used to act on those decisions. The second is that there is no single obvious prop- erty to optimize for in embedded systems. Depending on the system one may want to optimize for some combination of battery life, low-latency user experience, meeting real-time deadlines, reducing number of retransmits, code size, etc. 2. A MINIMAL EXTENSION TO C This Section considers minimal extensions to C to support heterogeneous multiprocessor systems with distributed mem- ory and shows that whilst these or similar extensions are necessary (and form the basis of SoC-C), they are not su���- cient for creating high performance, maintainable programs. This sets the stage for later sections which describe further extensions and optimizations to tackle these problems. The extensions considered in this Section are those re- quired to introduce parallelism, control sharing of resources and variables, communicate between threads, map data onto memories and map code onto processors/accelerators. Our descriptions of the extensions are brief because they are based on extensions found in other languages such as OpenMP (which inspired our notation), Concurrent Pascal, etc. Fig- ure 1 summarizes all the extensions discussed in this paper. Parallel sections introduce fork-join parallelism where a single master thread forks multiple child tasks (which may also fork child tasks) and waits for all children to complete. complex_t samples[2048] bool bits[3024] int8_t bytes[378] int timing_correction = 0 while (1) { ADC_get(&adc,&samples,2048) AdjustTiming(timing_correction,samples) FFT(samples) timing_correction += FindTimeOffset(samples) Demodulate(bits,samples) ErrorCorrect(bytes,bits) } Figure 2: A simplified OFDM radio receiver. The statement parallel_sections{ section{ statement1 } section{ statement2 } } executes statement1 and statement2 in parallel and com- pletes when both statements complete. Parallel sections can be implemented by forking one thread per section and then waiting for all threads to complete. Since this is the basic mechanism for expressing all parallelism, it is the program- mer���s responsibility to avoid race conditions, deadlock, etc. Channels synchronize/communicate between threads. FIFO channels provide two operations: ���fifo_put��� atomi- cally transfers data into the channel and ���fifo_get��� opera- tions atomically transfers data out (blocking if the channel is full/empty). This atomic-transfer semantics ensures that each thread has exclusive access to the data. Data placement annotations map variables to memo- ries. A variable declaration of the form type V @ M instructs the SoC-C compiler and linker to place the variable ���V��� in memory ���M���. Code placement annotations perform RPCs. A func- tion call of the form function(expr1, ... exprm) @ P is compiled into a synchronous remote procedure call: the function is invoked on processing element ���P���. Unlike most RPC implementations, the call-frame (i.e., which function to call and any scalar and pointer arguments) is copied to the processing element but bulk data structures are not copied. This reflects our design goal of giving the programmer con- trol over data copying to let them tune memory use and the impact on timing. To illustrate these minimal extensions, consider mapping the sequential program in Figure 2 onto the architecture shown in Figure 3. This program displays two different types of data dependency which must be handled when par- allelizing the program. There is forward dataflow within a loop iteration carrying complex samples from the ADC through timing correction, an FFT, demodulation and error correction. There is also feedback loop from one iteration to the next which continuously monitors changes in the timing offset between the transmitter and the receiver (caused by slight differences in clock rates, Doppler effects, etc.) which is used to control timing correction in future iterations. For simplicity, this example deals with fine timing correction (errors less than half the sample rate which are dealt with by applying a rotation to the complex samples) but ignores coarse timing correction (which would adjust the ADC in- 100

Readership Statistics

10 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
80% Ph.D. Student
 
10% Lecturer
 
10% Student (Postgraduate)
by Country
 
30% United States
 
10% United Kingdom
 
10% Italy

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in