As the flagship in TI's industry-leading TMS320 DSP family, the TMS320C8x generation is a true breakthrough in digital signal processing that offers to change the way we process information. The 'C80, the first member of the 'C8x generation, is the highest performance and most highly integrated DSP ever produced by Texas Instruments. Four advanced DSPs and a RISC master processor are integrated on a single chip to deliver over two billion RISC-like operations per second (BOPS).

The latest addition to the 'C8x generation is the 'C82, a scaled down version of the 'C80. The 'C82 provides two advanced DSPs coupled with a RISC master processor for high-performance, cost-sensitive applications.

'C8x Key specifications ('C82 specifications shown in parentheses):

  • 32-bit RISC master processor with IEEE-754 floating-point hardware
  • Four 32-bit parallel, advanced DSPs (Two 32-bit parallel, advanced DSPs)
  • 50 Kbytes of on-chip SRAM (44 KBytes of on-chip SRAM)
  • Video controller with dual frame timers (No video controller)
  • Built-in internal emulation and boundary scan paths through an IEEE 1149.1 test access port
  • Transfer controller for cache servicing and transferring data between external memory and internal SRAM
  • Each DSP uses a 32-bit local port and global access to on-chip SRAM data
  • Instruction ports consist of a 64-bit port for each DSP and a 32-bit port for the master processor
  • On-chip cache or data RAM is accessed via a 64-bit port
  • On-chip processors use crossbar switching to access on-chip RAM
  • Direct interface to DRAM, SDRAM, SRAM and VRAM

'C8x Key applications:

  • Large-scale video conferencing ('C80)
  • Desktop video conferencing ('C82)
  • Video phones ('C82)
  • Digital switching for cellular base stations ('C82)
  • Image processing
  • Video processing
  • Multimedia workstations
  • 2-D and 3-D graphics accelerators
  • Virtual reality
  • Real-time compression systems
  • Security
  • Radar/sonar systems
  • Cable TV video compression
  • Document imaging

TMS320C8x Master Processor (MP)

The master processor (MP) is a 32-bit RISC processor with an integral IEEE-754 floating-point unit. As with other RISC processors, all accesses to memory are performed with load and store instructions, and most integer and logical operations are performed on registers in a single cycle. The floating-point instructions are pipelined; therefore, you can start a single-precision multiply or any floating-point add instruction on each clock cycle. Moreover, the floating-point unit approaches 100 MFLOPS in performance at 50-MHz internal clock rate.

Floating-point operations use the same register file as the integer and logic unit. A register scoreboard ensures that correct register-access sequences are maintained.

The MP is structured for efficient execution of C code. For example, the MP contains an R0 register, often called a zeroing register, used by C. Also, the MP instruction set is tailored to contain many of the C executables found in compiler technology.

Features of the master processor include:

  • 32-bit RISC CPU delivering 50 MIPS @ 50 MHz
    • Targeted for high-level languages
  • IEEE-754 100-MFLOP floating-point unit
    • Parallel multiply, add, and load/store
  • 31 32-bit registers
    • Single file for integer and floating point
    • Loads and FPU results are scoreboarded
  • Instruction and data cache control
    • 4K-byte instruction cache
    • 4K-byte data cache
    • 2K-byte parameter RAM ('C80), 4K-byte parameter RAM ('C82)

TMS320C8x MP Floating-Point Unit

The MP's floating-point unit is capable of performing IEEE-754 floating-point operations in 32-bit single-precision and 64-bit double-precision floating point. Conversion between different formats is also supported. In addition, the floating-point unit provides vector floating-point operations with the option of performing a parallel load or store to improve program efficiency.

Hardware support for the floating-point unit consists of a full double-precision floating-point add unit and a 32-bit single-precision floating-point multiply unit:

  • IEEE-754 floating point
    • Hardware exception handling
  • FP add unit with double-precision ALU
    • 1-cycle adds/subs/compares (single and double) and conversions
    • 6-cycle single- and 20-cycle double-precision divide
    • 9-cycle single- and 26-cycle double-precision square root
  • The floating-point multiply unit performs all multiplies (integer and floating-point), divides, and square roots.
    • 1-cycle single-precision multiply
    • 4-cycle double-precision multiply
  • Pipelined-Can start a new instruction every cycle
    • 3-stage pipeline
    • Register file scoreboard prevents "races"
  • Vector FP for 100-MFLOP operation
    • Parallel multiply, add, and 64-bit load (p++) in one cycle
    • 4 double-precision accumulator registers support pipelining
    • Supports matrix multiplies, DCTs, and FFTs
  • FP status and interrupt-enable registers
    • MP's test-and-branch instructions access FP status

TMS320C8x Parallel Processing Advanced Digital Signal Processors (PP)

The parallel processing advanced digital signal processors (PPs) provide much of the 'C8x's performance. The PPs are designed to perform digital signal processing along with bit-field and multiple-pixel manipulation. These processors have advanced features that are not found in any other DSP or general-purpose processor and can perform in excess of ten RISC-like operations in each cycle.

In order to specify the multiple parallel operations that the PPs can perform, a wide instruction word of 64 bits is used. The instruction has fields that independently control the data unit and the two address units. All instructions execute in a minimum of a single cycle.

Each PP has a register file of 44 user-visible registers. All registers can be the source or destination of ALU or memory operations. The register set is divided into files according to each register's function. The PP features:

Additional features include:

  • Two address units
    • Up to two memory operations/cycle
  • Single-cycle multiplier
    • One 16-bit or two 8-bit results/cycle
  • Splittable 3-input ALU
    • Multiple operations in each pass
    • Up to four 8-bit results/cycle
  • Pixel and bit field hardware

  • 3-input ALU with mixed arithmetic and Boolean operations
    • Can perform masking at the same time as an add or subtract
  • Flexible data path feeding 3-input ALU
    • Fast bit and file processing
  • Address data paths can be used for general-purpose arithmetic
  • Byte/halfword multiple arithmetic
    • Single instruction stream, multiple data stream (SIMD) processing within each processor
    • Better handling of pixels and Z-buffers than in other DSPs or general-purpose processors
  • Eight primary data registers, d0 to d7 (D registers), that can perform up to seven reads and four writes
    • Two multiplier sources, three ALU sources, one multiplier result, one ALU result, and three LD/ST/MOVE
  • Splittable multiplier for fast pixel math
    • Any D register can be used on a multiply-with-parallel-add
  • Three levels of zero-overhead loops
  • Conditional operations (for ALU, load/store, and/or register source)

TMS320C8x PP Data Unit

The parallel-processing advanced DSP (PP) data unit has two data paths; each data path has its own set of hardware that functions independently of the other data path.

The ALU data path includes a barrel rotator, mask generator, 1-bit to n-bit expander, and a 3-input ALU that can combine the mask or expander output with register data to create over 2,000 different processing options. The 3-input ALU can perform 512 logical and/or mixed logical and arithmetic operations that support masking or merging and addition/subtraction in a single pass. The ALU can also be split to perform multiple 8-bit or 16-bit operations in parallel.

The PP data unit features are:

  • 3-input ALU (512 operations)
    • Mixed arithmetic and Boolean in one cycle (mask and add/sub in one pass)
    • Mask/merge and field processing
    • Splittable for multibyte operations
  • 16-bit ´ 16-bit multiplier (32-bit results)
    • Rounding for DCT accuracy
    • Splittable into two 8-bit ´ 8-bit multipliers (16-bit results)
  • Flexible data path
    • Barrel rotator
    • Mask generator
    • N-to-1 and 1-to-N translations via mf register
    • Left/rightmost one and bit-change
  • 44 user-visible registers
    • Any register can be operand of ALU
  • Eight D registers
  • Conditional operations
    • Conditional choice of register pair source
    • Conditional save of result

TMS320C8x Transfer Controller (TC)

The transfer controller (TC) is a combined DMA machine and memory interface that intelligently queues, prioritizes, and services the data requests and cache misses of the MP and the PPs. The transfer controller interfaces directly with the on-chip SRAMs. Through the TC, all of the processors can access the system external to the chip. In addition, data-cache or instruction-cache misses are automatically handled by the TC.

Data transfers are specifically requested by the PPs or the MP in the form of linked-list packet transfers, which are handled by the TC. These requests allow multidimensional blocks of information to be transferred between a source and destination, either of which can be on-chip or off-chip. Packet-oriented data transfers offer compatibility with several local area network standards, such as ATM.

The TC performs:

  • Cache fills and writes
  • Direct loads and stores from/to off-chip memory via DEA request
  • Block movement of data via packet transfers
  • Refresh and SRT (shift register transfer) cycles needed to maintain DRAMs and VRAM capture/display buffer respectively

Features of the TC include:

  • 400 Mbytes/s external bandwidth
  • Direct DRAM, VRAM, SRAM, and SDRAM control
  • Dynamic bus sizing (64, 32, 16, or 8 bits)
  • Packet transfers controlled autonomously by transfer controller
  • Linear x/y addressing
    • Independent source and destination
    • Automatic byte alignment
  • Intelligent request
    • Queuing and prioritization

The 'C82 TC includes a memory configuration cache that consists of six 32-bit words that describe the properties of the six most recently-used banks of memory. The cache automatically loads configuration words each time an access to a new bank is made and it can be locked into a set high or low priority. The configuration cache reduces the number of pins necessary in the 'C82 and in support chips.