### **Digital Platforms** (Past, Present and Future?)

### Francois Kapp

Including input from Andre Gunst, Rob Halsall, Philip Gibbs, Rachel Domagalski, Dan Werthimer, Jason Manley, Sias Malan, etc... 2016-03-09







## Radio Astronomy DSP – Prior Art

- A relatively small community by international standards
  - SKA1 MID CSP CBF < 0.2% of the annual forecast for Software Defined Radio market for 2020
- Common for each group to design and build complex custom boards, interconnected by complex custom backplanes
- The software, firmware and gateware was also custom











## A spirit of Reuse

- A spirit of reuse simultaneously emerged from multiple sources...
- Sharing the common building blocks helps us to build more telescope for our money
- Reuse from instrument to instrument
- Reuse from board to board
- Reuse from one generation to the next
- Common interfaces enable efficient reuse



















## SNAP

- Xilinx Kintex 7 FPGA
- 4x4 x Onboard ADC's support sampling up to 1 GHz
- 2x 10 Gigabit ethernet
- KATCP control via a Raspberry Pi running tcpborphserver
- Small and portable









### SNAP2

### SNAP2 Prof. Jie Hao, Institute of Automation





# Uniboard<sup>2</sup>

- IO (Total data throughput ~2Tbps):
  - 24x QSFP+ 40GBASE for SR LR
  - Backplane interface with:
  - 192x 10GBASE interfaces
  - 192x JESD204B ADC interfaces
- Processing (Total processing 5 TMAC/s)
  - Four Altera Arria10
  - 1288 MAC (multiply accumulate) per FPGA
- Memory 16 Gbyte DDR4 SODIMM 1066MT/s
  - Total memory UniBoard 128GByte









### SKA1 Low Processing board (Proposed)

- Single FPGA
- Parallel optical transceivers for I/O (4x12=48 fibres total)
- 4x HMC for bulk data storage
- SFP+ 10GbE for M&C
- Prototyping Apr-May-June



Science & technology Department: Science and Technology REPUBLIC OF SOUTH AFRICA



Foundation

### Virtex® UltraScale+™ FPGAs

|                                                               |                                      | Device Name              | VU9P      | VU11P     | VU13P       |  |
|---------------------------------------------------------------|--------------------------------------|--------------------------|-----------|-----------|-------------|--|
|                                                               |                                      |                          | 2,586     | 2,822     | 3,763       |  |
| Logic                                                         | System Logic Cells (K)               |                          |           | 1 -       |             |  |
|                                                               | CLB Flip-Flops (K)                   |                          | 2,364     | 2,580     | 3,441       |  |
|                                                               | CLB LUTs (K)                         |                          | 1,182     | 1,290     | 1,720       |  |
| Memory                                                        | Max. Distributed RAM (Mb)            |                          | 36.1      | 38.7      | 51.6        |  |
|                                                               | Total Block RAM (Mb)                 |                          | 75.9      | 70.9      | 94.5        |  |
|                                                               | UltraRAM (Mb)                        |                          | 270.0     | 270.0     | 360.0       |  |
| Clocking                                                      | Clock Manag                          | ement Tiles (CMTs)       | 30        | 12        | 16          |  |
| Integrated<br>IP                                              | DSP Slices                           |                          | 6,840     | 8,928     | 11,904      |  |
|                                                               | PCle <sup>®</sup> Gen3 x16 / Gen4 x8 |                          | 6         | 3         | 4           |  |
|                                                               | 150G Interlaken                      |                          | 9         | 6         | 8           |  |
|                                                               | 100G Ethernet w/ RS-FEC              |                          | 9         | 9         | 12          |  |
| I/O                                                           | Max. Single-Ended HP I/Os            |                          | 832       | 624       | 832         |  |
|                                                               | GTY 32.75Gb/s Transceivers           |                          | 120       | 96        | 128         |  |
| Speed                                                         |                                      | Extended                 | -1 -2L -3 | -1 -2L -3 | -1 -2L -3   |  |
| Grades                                                        |                                      | Industrial               | -1 -1L -2 | -1 -1L -2 | -1 -1L -2   |  |
|                                                               | Footprint <sup>(1,2)</sup>           | Dimensions (mm) (        | 32.75Gb/s |           |             |  |
| Footprint<br>Compatible<br>with 20nm<br>UltraScale<br>Devices | C1517                                | 40x40                    |           |           |             |  |
|                                                               | F1924 <sup>(3)</sup>                 | 45x45                    |           | 624, 64   |             |  |
|                                                               | 42104                                | 47.5x47.5                | 832, 52   |           |             |  |
|                                                               | A2104<br>B2104                       | 52.5x52.5 <sup>(4)</sup> |           |           | 832, 52     |  |
|                                                               |                                      | 47.5x47.5                | 702, 76   | 624, 76   |             |  |
|                                                               |                                      | 52.5x52.5 <sup>(4)</sup> |           |           | 702, 76     |  |
|                                                               |                                      | 47.5x47.5                | 416, 104  | 416, 96   |             |  |
|                                                               | C2104                                | 52.5x52.5 <sup>(4)</sup> | ,         |           | 416, 104    |  |
|                                                               | A2577                                | 52.5x52.5                | 448, 120  | 448, 96   | 448, 128    |  |
| -                                                             |                                      |                          |           | ,         |             |  |
|                                                               |                                      |                          |           |           |             |  |
|                                                               |                                      |                          | Prototyp  | e Pro     | Production? |  |
|                                                               |                                      |                          | / 1       |           |             |  |







## LFAA TPM

- Tile Processor Module
- 32x 1 Gs/s 8 bit ADCs
- 16x AD9680 JESD204B Dual ADCs
- 2x Xilinx KU040 FPGAs
- 2x 40G QSFP
- 128K Antenna
- 256K ADC channels
- 8K ADC boards
- 128 Racks





















### PowerMX





### **NETFPGA Sume**

- Xilinx Virtex-7 XC7V690T
- PCIe Gen3 x8 (8Gbps/lane)
- Three x36 72Mbits QDR II SRAM
- Two 4GB DDR3 SODIMM
- QTH Connector (8 GTH transceivers)
- Four SFP+ interface (4 GTH transceivers) supporting 10Gbps
- Two SATA-III ports
- One HPC FMC Connector (10 GTH transceivers)







### NETFPGA Sume - \$7k



www.xilinx.com







### Xilinx VC709 - \$5k



Department: Science and Technology REPUBLIC OF SOUTH AFRICA





### And GPU's...









### Intel?









## Some things we like in hardware

- Memory Bandwidth Compute balance is critical
- Unified Interconnect both infra- and intraboard/chassis/rack
- Keep it simple! 1 FPGA per board
- Being able to develop on/deploy a single board as a system allows more collaboration
- Remote reboot/reload/hw management
- SW support system think ecosystem
- Reuse, reuse, reuse









### Digital is more than hardware

- Much of the cost of development is in Gateware, Firmware and Software
- FPGA design has typically been the domain of somewhat rare experts
- A higher level design tools are required



















#### > sync sync\_out > > pol1\_in1 pol1\_out1 > > pol1\_in2 pol1\_out2 > > pol1\_in3 pol1\_out3 > > pol1\_in4 pol1\_out4 >

pfb\_fir\_real taps=4, add\_latency=1

| Function Block Parameters: pfb_fir_real                                                                                                                              | × |  |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|--|--|--|
| pfb_fir_real (mask)                                                                                                                                                  |   |  |  |  |
| Fold adders into DSPs: Causes adders to be absorbed into DSP blocks<br>(supported in Virtex5)<br>Adder implementation: Cores using Fabric or DSP48 or behavioral HDL |   |  |  |  |
| Parameters                                                                                                                                                           |   |  |  |  |
| Size of PFB: (2^? pnts)                                                                                                                                              |   |  |  |  |
| 12                                                                                                                                                                   |   |  |  |  |
| Total Number of Taps:                                                                                                                                                |   |  |  |  |
| 4                                                                                                                                                                    |   |  |  |  |
| Windowing Function: hamming                                                                                                                                          |   |  |  |  |
| Number of Simultaneous Inputs: (2^?)                                                                                                                                 |   |  |  |  |
| 2                                                                                                                                                                    |   |  |  |  |
| Make Biplex                                                                                                                                                          |   |  |  |  |
| 0                                                                                                                                                                    |   |  |  |  |
| Input Bitwidth:                                                                                                                                                      |   |  |  |  |
| 8                                                                                                                                                                    |   |  |  |  |
| Output Bitwidth:                                                                                                                                                     |   |  |  |  |
| 18                                                                                                                                                                   |   |  |  |  |
| Coefficient Bitwidth:                                                                                                                                                | • |  |  |  |
|                                                                                                                                                                      | _ |  |  |  |
| <u>O</u> K <u>C</u> ancel <u>H</u> elp <u>Apply</u>                                                                                                                  |   |  |  |  |



## Toolflow +

- Fantastic data oriented design language
- Rapid Application Development
- GUI environment
- Cross-platform (OS) support for development
- Configurable, parameterised, modular library
- Powerful MATLAB scripting environment
- Clock-cycle accurate simulations
- Tunable can trade resources between DSP/Logic/BRAM
- Abstract away low-level functions
- Clocks
- HW/SW i/f's
- One-click building







### Toolflow -

- GUI based third party software changes are outside our control
- Vendor lock-in is hard to avoid, requires investment
- No effective multi-clock domain support
- Verification
- Library Maintenance
- Revision Control
- IP management Open Source model may not be acceptable to all?









## HLS, OpenCL, etc

- HPC power challenges being addressed by introducing FPGA's in traditional Data Centres
- Google, Bing, Baidu, Amazon, Yahoo, IBM
- New programming options coming from industry
- Beyond HLS System Level Synthesis?







### And then it must all be connected









### Ethernet +/-

- Likes:
  - Multicasting support
  - Cheap to implement FPGA's provide hard macros
  - Resilient to errors
  - Scalable and Flexible
  - Interface to many diverse technologies
  - Simplified debugging and development
  - Simplified Interfacing to adjacent systems
  - Long lifetime, enables modular upgrade > 30yr compatibility lifetime
  - Multiplexing and demultiplexing signal streams is trivial
- Dislikes:
  - Inherently asynchronous complicates FPGA development
  - Some quirks to deal with (eg. Packet to self)







### Lessons learned

- Standard Interfaces are key
- HW is valuable, but short-lived
- SW and IP investment is much larger
- On-FPGA processors come and go SW investment unpredictable
- Must enable re-use, across institutions, devices and generations (parameterisation)
- Turnkey solution required, enable designers to implement instruments
- Production yield must be considered
- Scaling limits must be eliminated in both directions if possible







### The Future...



#### Image: Eva Kröcher, <u>CC license</u>







#### Transistor Count vs Year (FPGA's)









### Moore's 1<sup>st</sup> "Law"

- Dennard Scaling ended ~2005, power density no longer constant, clock rates constrained
- "our cadence today is closer to two and a half years than two" - Brian Krzanich (CEO Intel), June 2015
- 10nm in 2H2017
- 7nm in 2020?
- No promises beyond that yet









### Moore's 2<sup>nd</sup> "Law"

- "the cost of a semiconductor chip fabrication plant doubles every four years"
- This affects the cost of ASIC's at advanced nodes

### ✤aka Rock's Law







### **FPGA** Directions











## **Disruptions?**

- Silicon photonics
- Optical backplanes
- Packaging 2.5D, 3D, heterogeneous on package
- Memories 3D Xpoint, Memristor
- HPC and FPGA's







## How to plan for uncertainty?

- We're not big enough to push industry around (but we can nudge them)
- Use your favourite existing platform today
- Pick interfaces that may live long
- Do the system level work really well
- Look carefully at boundaries between CSP and SDP
- Don't down-select anything digital...
- Plan to upgrade the processing













### Thank You





