Cookbook
Table of Content
- 1 Table of Content
- 2 Table of figures
- 3 Table of tables
- 4 Digital design methodology
- 4.1 Synchronous design methodology
- 4.2 Recommended clocking schemes
- 4.3 RESET methodology
- 4.3.1 Global reset
- 4.3.2 Local reset
- 4.4 Things to avoid
- 4.4.1 Don’t use both clock rising and falling edges if not strictly necessary
- 4.4.2 Don’t use internally generated clocks if not strictly necessary
- 4.4.3 Resynchronize the RESET signal on the clock domain
- 4.4.4 Avoid using asynchronous RESET if possible
- 4.4.5 Don’t use asynchronous SET
- 4.4.6 Don’t use asynchronous initialization from a given value (signal or constant)
- 5 Writing efficient HDL source code
- 5.1 Avoid using combinatorial processes if not necessary
- 5.2 Avoid declaring and using un-necessary combinatorial signals
- 5.3 Use appropriate sensitivity list
- 5.4 Be careful when using relational operators
- 5.5 Avoid using un-necessary variable in the synthesizable source code
- 5.6 Don’t declare ports and signals as integers – if not necessary
- 5.7 Don’t use inactive or redundant assignments
- 5.8 Other tricks and tips for a more compact and re-usable source code
- 5.9 Design hierarchy
- 5.10 Naming rules of entities, component labels and signals:
- 5.11 Inference vs instantiation:
- 6 NG-MEDIUM architecture survey
- 7 NG-MEDIUM inputs and outputs (IOs)
- 7.1 Available user’s IOs
- 7.2 Simple and complex banks IO features
- 7.3 IO Standards usage
- 7.3.1 LVCMOS
- 7.3.2 SSTL and HSTL
- 7.3.3 LVDS:
- 7.4 Basic IO logical structure (simple and complex banks)
- 7.5 Simple banks advanced IO configuration
- 7.6 SERializers and DESerializers on complex banks
- 7.6.1 Introduction
- 7.6.2 SERDES architecture overview
- 7.6.3 DPA : Dynamic Phase Adjustment
- 7.7 IO features inference and instantiation:
- 7.7.1 Inference:
- 7.7.2 Instantiation:
- 7.8 IO blocks assignments with “nxpython” script
- 8 NG-MEDIUM clocks
- 9 FPGA core logic
- 9.1 Routing resources overview
- 9.2 Tile logic
- 9.2.1 Functional element (FE)
- 9.2.2 Carry logic:
- 9.2.3 X-LUT:
- 9.2.4 Register file (synchronous simple dual port RAM):
- 9.2.5 ClocK Switch (CKS):
- 9.3 RAM blocks (48Kb True Dual Port RAM)
- 9.4 DSP blocks
- 9.4.1 Architecture overview
- 9.4.2 Frequently used DSP functions
- 9.4.2.1 Multiplier 24 x 30 (with two DSP blocks) 250 MHz, 4 clock latency
- 9.4.2.2 Sequential FIR filter, based on a single multiplier / accumulator (MAC):
- 9.4.2.3 Sequential symmetric FIR filter, based on a single Pre-adder / MAC:
- 9.4.2.4 Parallel FIR filter implementation considerations :
- 9.4.2.5 Direct form parallel FIR filters:
- 9.4.2.6 Transpose structure for parallel FIR filters:
- 9.4.2.7 Systolic structure for parallel FIR filters:
- 9.4.2.8 Symmetric systolic structure for parallel FIR filters:
- 9.4.2.9 Symmetric transpose structure for parallel FIR filters:
- 9.5 Synchronous design methodology
- 9.6 Recommended clocking schemes
- 9.7 RESET methodology
- 9.7.1 Global reset
- 9.7.2 Local reset
- 9.8 Things to avoid
- 9.8.1 Don’t use both clock rising and falling edges if not strictly necessary
- 9.8.2 Don’t use internally generated clocks if not strictly necessary
- 9.8.3 Resynchronize the RESET signal on the clock domain
- 9.8.4 Avoid using asynchronous RESET if possible
- 9.8.5 Don’t use asynchronous SET
- 9.8.6 Don’t use asynchronous initialization from a given value (signal or constant)
- 10 Writing efficient HDL source code
- 10.1 Avoid using combinatorial processes if not necessary
- 10.2 Avoid declaring and using un-necessary combinatorial signals
- 10.3 Use appropriate sensitivity list
- 10.4 Be careful when using relational operators
- 10.5 Avoid using un-necessary variable in the synthesizable source code
- 10.6 Don’t declare ports and signals as integers – if not necessary
- 10.7 Don’t use inactive or redundant assignments
- 10.8 Other tricks and tips for a more compact and re-usable source code
- 10.9 Design hierarchy
- 10.10 Naming rules of entities, component labels and signals:
- 10.11 Inference vs instantiation:
- 11 NG-MEDIUM architecture survey
- 12 NG-MEDIUM inputs and outputs (IOs)
- 12.1 Available user’s IOs
- 12.2 Simple and complex banks IO features
- 12.3 IO Standards usage
- 12.3.1 LVCMOS
- 12.3.2 SSTL and HSTL
- 12.3.3 LVDS:
- 12.4 Basic IO logical structure (simple and complex banks)
- 12.5 Simple banks advanced IO configuration
- 12.6 SERializers and DESerializers on complex banks
- 12.6.1 Introduction
- 12.6.2 SERDES architecture overview
- 12.6.3 DPA : Dynamic Phase Adjustment
- 12.7 IO features inference and instantiation:
- 12.7.1 Inference:
- 12.7.2 Instantiation:
- 12.8 IO blocks assignments with “nxpython” script
- 13 NG-MEDIUM clocks
- 13.1 Semi-dedicated clock inputs
- 13.2 Clock management and distribution
- 13.3 Clocks distribution
- 13.3.1 Overview
- 13.3.2 Primary global clocks
- 13.3.3 Alternate global clocks or fast IO clocks
- 13.3.4 Local fast IO clocks
- 13.3.5 Clocks distribution summary
- 13.4 Application examples
- 14 FPGA core logic
- 14.1 Routing resources overview
- 14.2 Tile logic
- 14.2.1 Functional element (FE)
- 14.2.2 Carry logic:
- 14.2.3 X-LUT:
- 14.2.4 Register file (synchronous simple dual port RAM):
- 14.2.5 ClocK Switch (CKS):
- 14.3 RAM blocks (48Kb True Dual Port RAM)
- 14.4 DSP blocks
- 14.4.1 Architecture overview
- 14.4.2 Frequently used DSP functions
- 14.4.2.1 Multiplier 24 x 30 (with two DSP blocks) 250 MHz, 4 clock latency
- 14.4.2.2 Sequential FIR filter, based on a single multiplier / accumulator (MAC):
- 14.4.2.3 Sequential symmetric FIR filter, based on a single Pre-adder / MAC:
- 14.4.2.4 Parallel FIR filter implementation considerations :
- 14.4.2.5 Direct form parallel FIR filters:
- 14.4.2.6 Transpose structure for parallel FIR filters:
- 14.4.2.7 Systolic structure for parallel FIR filters:
- 14.4.2.8 Symmetric systolic structure for parallel FIR filters:
- 14.4.2.9 Symmetric transpose structure for parallel FIR filters:
- 14.4.3 DSP blocks inference
- 15 NG-MEDIUM configuration and interface pins
- 16 “nxpython“ synthesis and implementation tools
- 16.1 Introduction
- 16.2 Synthesis attributes
- 16.2.1 syn_noprune & syn_keep:
- 16.2.2 syn_preserve:
- 16.2.3 NX_USE:
- 16.2.4 NX_PORT:
- 16.2.5 NX_INIT:
- 16.3 “nxpython” features
- 16.4 “nxpython” script example
- 16.5 “nxpython” main reports:
- 17 Simulation
- 18 Introduction to NXcore
Table of figures
Typical example of physical path between two FFs
Possible FF meta-stability when setup/hold time violation occurs
Simple anti-meta-stability method using an additional FF
Safely resampling an asynchronous bus
Example of internally generated local clock
Illustration of possible timing diagrams
RESET distribution and associated timing diagrams
IO banks distribution and user’s IOs availability on LGA/CGA625 packages
IO banks distribution and user’s IOs availability on CQFP352 package
Impedance adaptation resistor connected to VTO for single ended SSTL or HSTL inputs
Input impedance adaptation for differential SSTL/HSTL inputs
SSTL/HSTL differential output buffer pair
Bidirectional SSTL/HSTL input/output buffer pair
Input impedance adaptation for LVDS inputs
Basic IO configuration (simple and complex banks)
SERDES data path simplified diagram
SERDES delay lines control block simplified diagram
Writing and reading delay registers (note that DIG = ‘1’)
Simplified clock distribution diagram
Simplified ClocK Generator (CKG) diagram
Simplified WaveForm Generator diagram
Diagram for WFGs synchronization
Synchronized WFGs timing diagram example
PLL divided outputs timing diagram (1)
PLL divided outputs timing diagram (2)
Multiple clocks generation with basic WFG configurations
Multiple clocks generation using WFG input inverter
Optimized multiple clock generation
NG-MEDIUM clock distribution overview
NG-MEDIUM global clock distribution (FPGA fabric & IOs)
NG-MEDIUM alternate global clock distribution (FPGA fabric) or complex banks fast IO clocks
NG-MEDIUM very fast local IO clock distribution
NG-MEDIUM central clock switch
NG-MEDIUM Input and output paths with classic clock distribution
Output timing with classic clock distribution
Input timing with classic clock distribution
NG-MEDIUM input and output paths improvement using PLL
Output timing improvement using PLL with feedback by clock tree
Input timing improvement using PLL with feedback by clock tree
Zero delay clock generation with additional clocks
Timing diagram of the Zero delay clock generation
Functional Element (FE) simplified diagram
Distribution of the logic resources available in a tile
Carry logic directly connected to 4 of the 8 neighboring FEs
X-LUT combines the output of the 4 neighboring FE’s LUTs
Register_File includes 64x16 SDP RAM array + 32 associated FEs
Register_File simplified internal diagram
RAM block organization and physical/logical mapping without EDAC
RAM block organization and physical/logical mapping with EDAC
Chaining DSP blocks in a same CGB row
Chaining DSP blocks in a same CGB row is from right to left on the
High performance Pipelined Multiplier 24 x 30 with rounding
Sequential FIR filter implementation with a single MAC
Sequential symmetrical FIR filter with a single Pre-add / MAC
Direct form parallel FIR filter
Direct form parallel FIR filter with adder tree
Direct form parallel FIR filter with adder chain
Transpose structure FIR filter
Transpose structure FIR filter and associated DSP blocks configuration
Systolic structure FIR filter and associated DSP blocks configuration
Symmetric systolic structure FIR filter
Symmetric transpose structure FIR filter
Table of tables
LGA/CGA 625 packages – available IOs
CQFP 352 package – available IOs
Simple and complex banks IO features summary
Simple IO banks electrical parameters and performance
Complex IO banks electrical parameters and performance
Recommended “termination” parameter values for 50, 75 and 100 Ohms impedance
Digital design methodology
NG-FPGAs offer a very flexible architecture that allows the implementation of a wide range of applications. However, the user must understand that a safe and reproducible behavior can be guaranteed only if some simple but efficient and necessary design rules have been adopted during the design steps.
Into an FPGA, just like in any other electronic component, the internal logic and routing delays can vary across the Process, Voltage and Temperature variations (PVT).
However, NG-MEDIUM and the “nxmap” synthesis and implementation tools provide robust architecture as well as implementation procedures to ensure a safe and reproducible behavior against those delays variations, just by following some very simple design rules.
To guarantee the correct behavior, the clock(s) must be distributed by using dedicated routing resources in order to guarantee very low skew across the FPGA die. The NG-MEDIUM FPGA fabric is split in two zones or clock regions. Each zone can use up to 12 low skew signals.
Clock(s) and other low skew routing resources: “nxmap” automatically assigns low skew routing resources to the clocks of your design, to ensure that the delay at the clock input of all FFs will be controlled, and the skew (maximum delay difference at the destinations) will be low enough and predictable. The maximum clock skew across the FPGA fabric is then controlled by construction, and limited to some tens of picoseconds.
Note that the low skew network can be optionally used for other high fanout signals. This can be the case of some heavily loaded signals like RESET and LOAD_ENABLE, if they are applied to a large number of FFs across the FPGA. In this case the “low skew” feature is less important than the maximum delay allowed to reach a large amount of destinations. This maximum delay can also be guaranteed by the use of the low skew network.
Applying timing constraints to your design: The user can constrain the design to support the required clock(s) frequency. The timing constraints are specified in the “nxpython” script file.
By applying a “Period” constraint to the clock, the user specifies the maximum delay allowed from any FF output to any FFs input (using the same clock edge) after implementation. The “Timing Analyzer” is embedded into the synthesis and implementation tools. It interacts with each one of the synthesis and implementation processes, in order to find – if possible - a solution that meets the user’s defined timing requirements.
The user can generate timing reports to check that the timing constraints were met during the implementation process.
The synthesis and implementation tools manage two different kinds of timing:
Silicon process related timing: The maximum values are fixed by the foundry silicon process (always the same for a specified device and speed-grade). Most timing parameters are documented in the datasheet.
Clock skew: difference of timing delays at the destination FF and the source FF clock inputs. This delay difference can be positive or negative, but always limited to some tens of picoseconds by silicon process. The timing analyzer gives the exact clock skew value for each analyzed path.
Clock skew = Clock_delay@FF_dest – Clock_delay@FF_src
Tco: FFs clock to output delay. Defines the delay between the clock edge at the FF input, for the Q output to be stable and valid.
Tsu: FF setup time. Defines the amount of time required by the target FF to safely sample the incoming data value at its D input.
Tcomb: combinatorial logic and routing delays (LUTs, carry logic and other combinatorial elements).
Implementation related timing: The routing delays on the connections (nets) between sources and destinations. Those delays are defined by the “Route” process (depending on the used routing resources). Note that the routing is dependent of the placement.
“nxmap” synthesis and implementation algorithms try to find a solution that meet the user’s timing requirements specified by constraints. The user can generate timing reports to analyze the implemented results.
To guarantee a stable behavior over PVT variations, the following condition must be met:
Period >= Tco(source) + ∑Tnets + ∑Tcomb + Tsu(destination) + Clock skew
Among the analysis tools:
Timing Analyzer: allows the user to generate timing reports. The timing analyzer commands are documented in the “NanoXplore nxmap Python API” documentation. Among the timing report information:
Identification of the clock domains detected in the design
Slack (timing margin) to meet the specified constraint(s). If the timing constraints are met, the slack is positive. Otherwise, the slack is negative.
Detailed delays on selected path(s)
“nxmap” GUI: Graphical User’s Interface to analyze placement and routing results. See NXmap User Manual documentation for detailed information. The user can observe the location of each IO port, tile logic elements, BRAMs, DSP blocks, PLL and WaveForm Generators and have a detailed or simplified view of the used routing resources.
Synchronous design methodology
Did you ever face a “haunted” design? This kind of design that surprisingly works “most of the time”… but fails “sometimes”, particularly during demonstrations or inaugurations?
Most of the time, if not due to PCB or power supplies issues, those random problems are due to:
Inadequate clock distribution (e.g.: local clock distributed by general routing resources), generally related with internal clock generation with logic.
Inadequate reset methodology
Local asynchronous SET and/or RESET conditions in the source code.
Lack of anti-meta-stability and resynchronization stages when crossing asynchronous clocks domains.
Missing timing constraints. “nxmap” will ignore the timing on unconstrained paths – and then doesn’t warn about possible timing errors.
Recommended clocking schemes
Single clock rising edge:
Whenever possible, this the most recommended, simplest and safer clock distribution scheme. This clocking scheme provides the easiest way for both user and tools to implement a safe design working at any frequency from DC to a maximum frequency determined by the timing analyzer (period limited by the longest path from FF to FF).
The internal timing can be constrained with a single timing constraint: “Period”. However inputs and outputs of the design might require additional constraints to specify the timing required on the inputs (setInputDelay) and the outputs (setOutputDelay), according the datasheet of the external components. See the timing constraints chapter for more information.
In order to reduce the FPGA internal clock propagation delay, the PLL + WFG (WaveFormGenerator) can be used to generate a ZERO Delay clock distribution. Cancelling the clock delay distribution reduces the FPGA clock to output pads, and reduces or eliminates potential hold time problems on the FPGA inputs.
Multiple synchronous clocks: Using dedicated clock management resources
NG-MEDIUM provides 4 sets of clock management resources (ClocK Generators or CKG), located at the corners of the die. Each CKG include one PLL and 8 WaveForm Generators.
The PLL reference input frequency can come from single ended or differential semi-dedicated clock input pins. At its outputs, the PLL can generate a wide range of clock outputs by applying frequency multiplication and/or division factors on the incoming input.
In order to reduce the FPGA internal clock propagation delay, the PLL + WFG can be used to generate a ZERO Delay clock distribution. Cancelling the clock delay distribution reduces the FPGA clock to output pads, and reduces or eliminates potential hold time problems on the FPGA inputs.
The WaveForm Generators (WFG) can be used as clock buffers. They provide direct routing to the low skew network. The WFG can also be used to generate clock dividers and user programmable patterns. See NG-MEDIUM datasheet for more information.
By combining PLL and WaveFormGenerators, the user can generate internally synchronous clocks, like for example:
Main_clock: same phase and frequency as the input clock pad
Higher_frequency_clock: a multiple of the input frequency (ex: Fin x 2)
Lower_frequency_clock: divided input clock frequency (ex: Fin / 2)
Those 3 clocks are synchronous together. Clock domain changes will be easily managed by the synthesis and implementation tools. No meta-stability or resynchronizing issues while timing constraints are met (more details on meta-stability issues on chapter 1.2.3).
Multiple asynchronous clocks: Meta-stability issues and resynchronization
When a signal synchronous of a clock is resampled by another asynchronous clock, it must be resynchronized to avoid unstable behavior due to meta-stability.
The meta-stability is an invalid logic level at a FF output, caused by a transition on the D input of the FF during its setup/hold window. This invalid logic level can cause incorrect behavior of your design.
When registering an asynchronous signal, the meta-stability phenomenon can’t be avoided, but simple design rules allow to cancel its negative and unreproducible effects.
Fortunately, the meta-stability doesn’t propagate from a FF to another, providing that the connection delay to the second FF is limited to a small fraction of the clock period.
Two cases must be considered.
Case 1: Resynchronizing a single signal (one bit):
In this example, the first FF will be subject to meta-stability. However, this invalid logic level will not be propagated to the next FF, particularly if the propagation delay between both FFs is short.
Case 2: Resynchronizing a multibit bus (two or more bits):
Multibit busses cannot be directly resynchronized just by applying the same technics to each one of the bus bits. Fortunately, a bus is qualified by an additional signal such as “DATA_VALID” or any other signal that indicates when the bus has a stable value. Thus, the user can safely resynchronize this control signal, and use the resynchronized version to sample the bus value,
The user must make sure that the clock frequency is high enough to sample the bus value while its value is still stable.
RESET methodology
The NG-MEDIUM internal Flip-Flops have a dedicated RESET input. The tile FFs can be reset synchronously or asynchronously, while the registers embedded onto BRAM and DSP blocs support exclusively synchronous reset.
Global reset
Can be synchronous or asynchronous – for the tile FFs. However, in any case, to guarantee a safe startup, the reset signal must be properly resynchronized on the design master clock, in order to avoid any meta-stability condition during the first active clock cycle.
Remember that BRAM and DSP registers can be reset synchronously only (no asynchronous reset available). However, independently of the synchronous or asynchronous usage of the reset, the risk of meta-stability during the first active clock cycle exists if the reset signal is not synchronous of the clock.
When using PLL for internal clock(s) generation, remember that the generated clocks are not safe during the PLL locking process. For a safe design startup, ensure your design is reset at least until the PLL locked status (RDY) is set.
A simple and efficient mechanism consists in delaying the RDY output of the PLL by some clock periods, as in the following source code sample:
signal RESET_DELAY : std_logic_vector(7 downto 0);
signal INTERNAL_RESET : std_logic;
begin
process(CLK_generated_by_PLL_and_WFG) begin
if rising_edge(CLK_generated_by_PLL_and WFG) then
RESET_DELAY <= RESET_DELAY(6 downto 0) & not(RDY);
end if;
end process;
INTERNAL_RESET <= RESET_DELAY(7);
RESET_DELAY is the delay line (8 steps of one clock period each). The last bit of the chain is used as INTERNAL_RESET. NanoXplore recommends to use at least two levels of registers on the reset delay line. It can be safely used as synchronous or asynchronous reset of the design. In any case, the timing constraints will cover all timing paths, including INTERNAL_RESET source to any Flip-Flop, including BRAM, DSP blocks and IO FFs.
in NG-MEDIUM, the “nxmap” implementation tools will use the low skew network – if possible - for the internal reset routing – taking in account the high fanout of this signal.
Local reset
For local RESET (to be applied only to a partial set of FFs), the synchronous way should be prefer. This gives more control of the routing and logic delays to the implementation tools.
Remember that an asynchronous reset is glitch sensitive.
Don’t apply both Asynchronous_SET and Asynchronous_RESET to the same Flip-Flop(s). The internal Flip-Flops have a dedicated synchronous or asynchronous RESET input only. The tile FFs reset can be synchronous or asynchronous, while the registers embedded onto BRAM, DSP support exclusively synchronous reset.
Things to avoid
Don’t use both clock rising and falling edges if not strictly necessary
Most designs can be implemented by using exclusively the clock(s) rising edges for the FPGA internal logic. This gives more flexibility and timing control to the implementation tools.
Don’t use internally generated clocks if not strictly necessary
Internally generated clocks (by using combinatorial or registered logic) create race conditions that drives to unpredictable or unstable behavior.
Remember that internal clocks can be easily and safely generated synchronously with the main clock by using the NG-MEDIUM PLL and Waveform Generators.
The first figure illustrates the schematics of a portion of design, where an internal local clock is generated using a FF output. This creates a race condition, where the routing delays (of the data and the local clock) will impact the behavior (will be unstable over PVT variations). See the timing diagrams on second figure (case 1 and case 2). We can clearly see that the behavior will be routing dependent – and probably unstable over PVT.
Resynchronize the RESET signal on the clock domain
The tile Flip-Flops have a dedicated input for synchronous or asynchronous RESET.
Reset de-assertion is very critical
If the Reset signal (used as asynchronous or synchronous reset) is not synchronized on the FPGA clock, it can create setup violations on many Flip Flops during its de-assertion.
Risk of hazardous startup!!!
Timing constraints can’t help to avoid this problem
The RESET signal is propagated by using routing resources to the destination FFs. However, even if distributed by low skew lines, its de-assertion can be interpreted differently by the FFs, and can cause hazardous startup.
This issue can be easily overcome by resynchronizing the RESET input, using anti-metastability FFs.
The resynchronized RESET signal can be used as Asynchronous or Synchronous RESET.
If used as synchronous RESET, the implementation tools and the timing analyzer will control the propagation delays to ensure a predictable behavior at the specified frequency.
Avoid using asynchronous RESET if possible
The tile Flip-Flops have a dedicated input for synchronous or asynchronous RESET.
Asynchronous reset is glitch sensitive, while synchronous reset is part of your synchronous design, and then it’s covered by the period constraint – if generated synchronously to the clock domain. The implementation tools and the timing analyzer will control the propagation delays to ensure a predictable behavior at the specified frequency.
Don’t use asynchronous SET
There is no dedicated asynchronous SET input on the tile FFs. However synchronous SET can be easily and safely implemented by combining LUT + FF of the same FE (NG-MEDIUM logic cell that includes one 4-input LUT and on D Flip Flop.
However, “nxmap” synthesis tools can build the behavior of asynchronous set at the cost FF, extra logic resources mapped to LUTs in another FE and additional routing delays (uses more logic resources, and poor performance in terms of power consumption and working frequency).
Don’t use asynchronous initialization from a given value (signal or constant)
Asynchronous initialization from a signal value will prevent the synthesis and implementation tools from using dedicated flip-flops. Combinatorial loops can be generated. The resulting behavior can be unpredictable.
Example of source code to be avoided:
process(CLK, INIT) begin
if INIT = ‘1’ then -- Asynchronous initialization
DATAR <= DATA_IN; -- Assigned value is not a constant
elsif rising_edge(CLK) then
if ENA = ‘1’ then
DATAR <= CNT;
end if;
end if;
end process;
Instead, the following code will be prefer:
process(CLK) begin
if rising_edge(CLK) then
if INIT = ‘1’ then
DATAR <= DATA_IN; -- Synchronous initialization
elsif ENA = ‘1’ then
DATAR <= CNT;
end if;
end if;
end process;
Writing efficient HDL source code
The quality of the source code is the most important factor to ensure an efficient, stable and predictable design.
Whenever possible, use a simple, compact and clear writing style.
HDL synthesis is the first step of the implementation process.
If for any reason, the synthesis results are not optimized enough for your design requirements, there will be no way to change this during the subsequent mapping, place and route processes.
The HDL source code is probably your main investment for maintainability, design density, power reduction and performance optimization
Source code must be optimized for the targeted architecture
As much as possible, it must be also flexible and portable (to other architectures or synthesis tools)
Readability is another very important factor
Write a direct, simple and clear source code
The more compact, the more readable
The synthesis tools can also make a better translation to take advantage of the silicon features when the source code is compact and clear
Avoid using combinatorial processes if not necessary
Combinatorial processes can generate latches and combinatorial loops. This can led to unpredictable or unstable behavior, and have a negative impact on logic and routing resources utilization.
Be very careful if you have to write combinatorial processes.
Avoid declaring and using un-necessary combinatorial signals
In a synchronous design, the combinatorial signals are registered with FFs.
Generally, it’s simpler, faster and more efficient to use a single process to define the global (combinatorial logic and register) in a single clocked process.
Don’t declare un-necessary signals if those signals must be registered
All NG-MEDIUM configurable elements have their own Flip Flop (tile logic, BRAMs, DSP Blocks, and IOs). The synthesis automatically will recognize that the function can be implemented in the same elements (by packing combinatorial logic and the FFs into the same logic element such as Functional Elements, BRAMs or DSP blocks)
Reduced code size and improved readability
Apply this method also for state machines (you will avoid timing and implementation problems)
Have a look on the following VHDL source code that describes a pipelined adder-multiplier function.
signal A, B, C: std_logic_vector(15 downto 0); -- A, B and C inputs
signal A_REG, B_REG, C_REG: std_logic_vector(15 downto 0); -- registered inputs
signal A_PLUS_B: std_logic_vector(15 downto 0); -- combanitorial added output
signal A_PLUS_B_REG: std_logic_vector(15 downto 0); -- registered adder output
signal MULT: std_logic_vector(31 downto 0); -- combanitorial multiplier output
signal MULT_REG: std_logic_vector(31 downto 0); -- registered multiplier output
begin
process(CLK) begin
if rising_edge(CLK) then
A_REG <= A;
B_REG <= B;
C_REG <= C;
end if;
end process;
A_PLUS_B <= A_REG + B_REG;
process(CLK) begin
if rising_edge(CLK) then
A_PLUS_B_REG <= A_PLUS_B;
end if;
end process;
MULT <= A_PLUS_B_REG * C_REG;
process(CLK) begin
if rising_edge(CLK) then
MULT_REG <= MULT;
end if;
end process;
The same behavior can be written as follows. Readability is increased.
signal A, B, C: std_logic_vector(15 downto 0); -- A, B and C inputs
signal A_REG, B_REG, C_REG: std_logic_vector(15 downto 0); -- registered input
signal A_PLUS_B_REG: std_logic_vector(15 downto 0); -- registered adder output
signal MULT_REG: std_logic_vector(31 downto 0); -- registered multiplier output
begin
process(CLK) begin
if rising_edge(CLK) then
A_REG <= A;
B_REG <= B;
C_REG <= C;
A_PLUS_B_REG <= A_REG + B_REG;
MULT_REG <= A_PLUS_B_REG * C_REG;
end if;
end process;
At this time, we do not take in consideration the rules for signed or unsigned operations. This example is just to show that there is an easy and compact way to describe the same functionality with very few lines.
For arithmetic and/or DSP functions, see the chapter DSP blocks.
Use appropriate sensitivity list
Un-appropriate sensitivity list can drive to synthesis/simulation misunderstanding. The simulation behavior will not necessarily match the implementation results.
For simple clocked processes (with NO asynchronous reset or initialization), only the clock signal must be in the sensitivity list
process(CLK) begin
if rising_edge(CLK) then
…… -- assignments
end if;
end process;
For clocked processes with asynchronous re-initialization of FFs (not recommended), clock and reset (or any other asynchronous signal) must be in the sensitivity list
process(CLK, RST) begin
if RST = ‘1’ then
…… -- signal assignments to ‘0’ or (others => ‘0’)
elsif rising_edge(CLK) then
…… -- assignments
end if;
end process;
For combinatorial processes (not recommended) all the involved signals in the assignments must be in the sensitivity list (does not prevent generation of latches).
Be careful when using relational operators
VHDL allows to compare busses with different number of bits.
The VHDL relational operators =, <, >, <= and >= work in a very surprising way when the number of operands bits doesn’t match.
In addition, synthesis and simulation can have different interpretations, depending on the context.
Conclusion: Make sure the busses length are the same in both operands
Avoid using un-necessary variable in the synthesizable source code
If not strictly necessary, it’s safer and better to use signals instead of variables:
In synthesis, most often, variables do not have a logical representation in the synthesized netlist or in simulation.
Very often, using variables can make more complex the synthesizable source code – for synthesis as well as simulation) while providing no benefit
Conclusion: use signal instead of variable whenever is possible.
Don’t declare ports and signals as integers – if not necessary
The packages std_logic_unsigned and std_logic_signed provide direct type conversion, and additional functions. For example you can simply write :
CNT <= CNT + 1; -- CNT is a std_logic_vector
Additional conversion functions are provided within the std_logic_arith package
Conv_integer(std_logic_vector_signal)
Converts the value of the signal declared as a std_logic_vector to its equivalent integer value (according the used std_logic_unsigned or std_logic_signed package).
Conv_std_logic_vector(integer_value, std_logic_vector_number of bits)
Converts an integer value to its equivalent std_logic_vector value. The number of bits is specified as the second argument (according the used std_logic_unsigned or std_logic_signed package).
Conclusion: use std_logic_vector unless exception.
Don’t use inactive or redundant assignments
In VHDL, if no condition is found to assign a new value to a signal, the signal will keep its previous value
In a combinatorial process, assigning a condition to maintain the previous value to a signal will produce latches and/or combinatorial loops.
In a clocked process if a condition enables some assignments, the dedicated FFs load or clock_enable will be used
Don’t use the null statement if not strictly necessary
Leave your source code as compact as possible, taking advantage of the standard packages and the synthesis rules. In the following simple example, the lines in green do not help for anything. Instead, in some more complex cases, they could prevent the synthesis tools from optimizing the resulting netlist.
process(CLK) begin
if rising_edge(CLK) then
if ENA = '1' then
SIG_R <= SIG_IN;
-- else
-- SIG_R <= SIG_R;
end if;
end if;
end process;
Other tricks and tips for a more compact and re-usable source code
Use (others => ‘0’) when several bits are assigned to the same value.
Example for reset condition :
REG_DATA <= (others => ‘0’);
Works for any bus length
Example for High impedance output buffers :
if WRITE_EXT_MEM = ‘1’ then
DOUT <= REG_DATA;
else
DOUT <= (others => ‘Z’);
end if;
Use generic parameters whenever your module can be re-used with different parameters. This is particularly useful for memory and DSP functions. Example :
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_unsigned.all.
use IEEE.std_logic_arith.all;
entity MY_MEMORY is
generic (
Number_of_address_bits : integer := 10,
Number_of_data_bits : integer := 16
);
port (
CLK : std_logic;
DIN : std_logic_vector(Number_of_data_bits-1 downto 0);
WE : std_logic;
ADR : std_logic_vector(Number_of_address_bits-1 downto 0);
DOUT : std_logic_vector(Number_of_data_bits-1 downto 0)
);
End MY_MEMORY;
architecture ARCHI of MY_MEMORY is
type MEM_TYPE is array((2**Number_of_address_bits)-1 downto 0 of std_logic_vector(DIN’range);
signal MEM : MEM_TYPE;
begin
process(CLK) begin
if rising_edge(CLK) then
if WE = '1' then
MEM(conv_integer(ADR)) <= DIN;
else
DOUT <= MEM(conv_integer(ADR)); -- DOUT doesn’t change
end if; -- during write
end if;
end process;
end ARCHI;
This same module can be instantiated several times in your design with different configuration for each instance, by assigning individual parameters sets to each instance (Number_of_address_bits and Number_of_data_bits assigned by generic map).
Take advantage of the “std_logic_arith” and “std_logic_unsigned” or “std_logic_signed” package for a more compact and more readable source code, thanks to the implicit and explicit conversion functions.
Note that the package “numeric_std” is also supported by “nxmap”.
Design hierarchy
Must be organized in a logical way. Avoid mixing unrelated functions into the same hierarchical module. It will be easier to assign synthesis options and directives.
Hierarchical modules can be synthesized separately to verify the quality of results such as :
Used logic resources (tile logic, BRAM, DSP blocks…)
Timing performance for the considered module.
Register all the outputs of the hierarchical modules – if performance is required
Naming rules of entities, component labels and signals:
Good and clear naming rules greatly improves the readability
Short but clear names can ease the recognition on debug and verification tools (timing analyzer, NXmap GUI, simulation)
However, too short names can drive to confusion or additional difficulties to recognize some elements (signals, entities, components…)
Be careful with reserved words or other words commonly used (WRITE, READ, BUS, COUNT…).
during the synthesis process, the FF outputs signals are automatically renamed by the tools, by adding “_reg” to the original name defined in the source code. As an example, a signal called “DATA_CHANNEL” in the source code will be renamed “DATA_CHANNEL_reg” after synthesis if it’s generated by a FF or group of FFs. Be careful not to name any other non registered signal “DATA_CHANNEL_reg” to avoid possible post-synthesis conflicts.
Signal names should reflect their polarity. For example:
RST is an active high signal (resets the FFs when it goes high, while RST_N is active low.
LOAD is active high (loading occurs when LOAD is high), while LOAD_N is active low.
Inference vs instantiation:
Inference advantages and limitations
The inference describes the behavior of the functions to be synthesized and implemented, with standard HDL description. As a result, the source code is portable and can be used with other architecture or tools.
Synthesis and mapping options: Most high performance NG-MEDIUM cells can be inferred with a very simple source code and implemented as desired. However, by using some mapping options, the user can have a better control over the synthesis and mapping processes, providing that the described functionality matches the FPGA elements behavior. This is particularly true for RAM inference – that could be virtually implemented with RAM blocks or Register_file (RF). See the “addMappingDirective” in the NanoXplore NXmap Python API documentation for more information.
However, some NG-MEDIUM built-in functions cannot be inferred. This is the case for example of the PLL, WaveformGenerators using patterns, some RAM and DSP blocks configurations, high performance IOs using DDR or SERDES.
In such cases, it might be necessary to instantiate the primitives.
Instantiation of NG-MEDIUM primitives
NG-MEDIUM primitives can also be instantiated:
Register_file
RAM blocks
DSP blocks
PLL and WFG
Although memories (Register_file and BRAM) can be easily and efficiently inferred with simple and portable source code for most common functions, the user might prefer to instantiate the elements. The source code is not portable, but the user can get access some features that are not necessarily accessible by inference.
In addition, some primitives cannot be inferred. They can be used only by instantiation.
PLL most often combined with WaveForm Generator(s)
Dual port RAM with different bus sizes on both ports
Some DSP blocks configurations
Other primitives
See the “Library guide” for more information.
NG-MEDIUM architecture survey
Before starting your design, it’s important to acquire a proper understanding of the NG-MEDIUM architecture. From the user’s point of view, we can separate the architecture in three main blocks:
The user’s IO ring (organized in IO banks). Includes flexible single ended and/or differential IOs, DDR registers, SpaceWire compatible interfaces IO blocks, and many more features (input and output calibrated delay lines, output serializers, input de-serializers…).
There are also 4 clock generators also called CKG (PLL + WaveFormGenerators – one set in each FPGA corner of the die).
The FPGA core logic offers
Tile logic for combinatorial or registered functions, arithmetic, register_files (64 x 16-bit synchronous simple dual port memories – with EDAC)
Flexible 48Kbit synchronous true dual port memory blocs (including user’s selectable EDAC)
DSP blocs for high performance complex DSP functions
The FPGA configuration logic and dedicated IO interface
Please, refer to the next chapters as well as NG-MEDIUM data sheet for detailed information.
NG-MEDIUM inputs and outputs (IOs)
The NG-MEDIUM user’s IOs are organized in 13 IO banks. Each IO bank have a single power supply (Vddio) for all the IOs into the same bank.
The top and bottom banks are called “complex”. The complex IO banks provide more flexibility and performance than the left and right banks that are called “simple”.
All IOs can be configured as input, output or bi-directional IO.
The next figures show the IO banks location, and the numbering of the die IO blocks. For physical pin numbers on the selected package, please consult the NG-MEDIUM datasheet. Currently, NG-MEDIUM is available in three different packages:
LGA625 : Land-Grid array 625 pins
CGA625 : ceramic column-Grid array 625 pins
CQFP352 : ceramic Quad-flat package 352 pins
In addition another bank located to the left of the FPGA die is used for FPGA configuration. This document doesn’t cover the configuration process. Please, refer to the NG-MEDIUM datasheet for detailed information about the FPGA configuration modes and pins.
Available user’s IOs
LGA/CGA 625 packages:
All the 374 die user’s IOs are available
Bank | Type | I/Os | Location | Bank | Type | I/Os |
|
|
|
|
|
|
|
0 | Simple | 22 | Left | 1 | Simple | 22 |
2 | Complex | 30 | Bottom | 3 | Complex | 30 |
4 | Complex | 30 | Bottom | 5 | Complex | 30 |
6 | Simple | 30 | Right | 7 | Simple | 30 |
8 | Simple | 30 | Right |
|
|
|
9 | Complex | 30 | Top | 10 | Complex | 30 |
11 | Complex | 30 | Top | 12 | Complex | 30 |
CQFP 352 package:
Only 192 of the 374 die user’s IOs are available
Bank | Type | I/Os | Location | Bank | Type | I/Os |
|
|
|
|
|
|
|
0 | Simple | 14 | Left | 1 | Simple | 12 |
2 | Complex | 30 | Bottom | 3 | Complex | - |
4 | Complex | - | Bottom | 5 | Complex | 30 |
6 | Simple | 22 | Right | 7 | Simple | - |
8 | Simple | 24 | Right |
|
|
|
9 | Complex | 30 | Top | 10 | Complex | - |
11 | Complex | - | Top | 12 | Complex | 30 |
Simple and complex banks IO features
Each IO is composed of two different and complementary elements:
The IO buffers (primitives NX_IOB_x): can be used as input, output or bi-directional. Single ended I/Os have a default 10K to 40K PullUp. The IO buffers can be configured to work in a wide range of single ended and differential electrical standards (LVCMOS, SSTL, HSTL, and LVDS). The IO buffers can be instantiated or inferred. They also can be parametrized in the “nxpython” script to adapt their electrical configuration to meet the board design requirements. Among parameters :
Output drive
Slew rate (for LVCMOS outputs)
Turbo mode (faster LVCMOS input)
Optional 2K to 6K PullUp
Optional adjustable termination (only for complex banks – SSTL, HSTL, LVDS)
The sequential input and output elements (input, output and tri-state control FF). It includes the following elements :
Single flip-flop on input, output and tri-state paths
Optional adjustable delay lines on the input, output and tri-state paths (0 to 63 x 160 ps delay)
The NG-MEDIUM IO ring is segmented into 13 IO banks. The left (B0 and B1) and the right (B6, B7 and B8) IO banks offer flexible but limited features. They are called “simple” banks.
Instead, the top (B9, B10, B11 and B12) and bottom banks (B2, B3, B4 and B5) are called “complex” banks and offer more electrical and functional features.
The following table summarizes the main IOs features available into complex and simple IO banks.
Feature | Complex | Simple |
Number of IOs | 30 | 30/22 |
|
|
|
Power supply (Vddio) | 1.8, 2.5 or 3.3v | 1.8, 2.5 or 3.3v |
|
|
|
Supported IO standards | LVCMOS, SSTL, HSTL, LVDS | LVCMOS, SSTL, HSTL, LVDS |
|
|
|
Single DFF (in, out, tri-state) | Yes | Yes |
Differential SSTL/HSTL | Yes | No |
LVDS | Yes | Yes (no internal input termination) |
Resistive input termination | Yes | No |
Programmable input/output delay | Yes | Yes |
CDC (Clock Domain Changer) | Yes | No |
Shift Register | Yes | No |
DDR mode | Yes | No |
SpaceWire | Yes | No |
Simple banks: Electrical standards and supported electrical parameters
Standard | Type | Bank supply | Drive | Speed | Special considerations |
LVCMOS 3.3V | SE (*) | 3.3 V | 2–16 mA | 100Mb/s | In NG-MEDIUM, single ended I/Os have an internal default 10K to 40K PullUp. In addition the user can active a slower value (2K to 6K) optional PullUp Slew rate SLOW/FAST for outputs Turbo mode for inputs (faster inputs at the cost of higher static power) (Those electrical parameters can be set by constraints in a script file) |
© NanoXplore 2025