Table of Content
Table of figures
Typical example of physical path between two FFs
Possible FF meta-stability when setup/hold time violation occurs
Simple anti-meta-stability method using an additional FF
Safely resampling an asynchronous bus
Example of internally generated local clock
Illustration of possible timing diagrams
RESET distribution and associated timing diagrams
IO banks distribution and user’s IOs availability on LGA/CGA625 packages
IO banks distribution and user’s IOs availability on CQFP352 package
Impedance adaptation resistor connected to VTO for single ended SSTL or HSTL inputs
Input impedance adaptation for differential SSTL/HSTL inputs
SSTL/HSTL differential output buffer pair
Bidirectional SSTL/HSTL input/output buffer pair
Input impedance adaptation for LVDS inputs
Basic IO configuration (simple and complex banks)
SERDES data path simplified diagram
SERDES delay lines control block simplified diagram
Writing and reading delay registers (note that DIG = ‘1’)
Simplified clock distribution diagram
Simplified ClocK Generator (CKG) diagram
Simplified WaveForm Generator diagram
Diagram for WFGs synchronization
Synchronized WFGs timing diagram example
PLL divided outputs timing diagram (1)
PLL divided outputs timing diagram (2)
Multiple clocks generation with basic WFG configurations
Multiple clocks generation using WFG input inverter
Optimized multiple clock generation
NG-MEDIUM clock distribution overview
NG-MEDIUM global clock distribution (FPGA fabric & IOs)
NG-MEDIUM alternate global clock distribution (FPGA fabric) or complex banks fast IO clocks
NG-MEDIUM very fast local IO clock distribution
NG-MEDIUM central clock switch
NG-MEDIUM Input and output paths with classic clock distribution
Output timing with classic clock distribution
Input timing with classic clock distribution
NG-MEDIUM input and output paths improvement using PLL
Output timing improvement using PLL with feedback by clock tree
Input timing improvement using PLL with feedback by clock tree
Zero delay clock generation with additional clocks
Timing diagram of the Zero delay clock generation
Functional Element (FE) simplified diagram
Distribution of the logic resources available in a tile
Carry logic directly connected to 4 of the 8 neighboring FEs
X-LUT combines the output of the 4 neighboring FE’s LUTs
Register_File includes 64x16 SDP RAM array + 32 associated FEs
Register_File simplified internal diagram
RAM block organization and physical/logical mapping without EDAC
RAM block organization and physical/logical mapping with EDAC
Chaining DSP blocks in a same CGB row
Chaining DSP blocks in a same CGB row is from right to left on the
High performance Pipelined Multiplier 24 x 30 with rounding
Sequential FIR filter implementation with a single MAC
Sequential symmetrical FIR filter with a single Pre-add / MAC
Direct form parallel FIR filter
Direct form parallel FIR filter with adder tree
Direct form parallel FIR filter with adder chain
Transpose structure FIR filter
Transpose structure FIR filter and associated DSP blocks configuration
Systolic structure FIR filter and associated DSP blocks configuration
Symmetric systolic structure FIR filter
Symmetric transpose structure FIR filter
Table of tables
LGA/CGA 625 packages – available IOs
CQFP 352 package – available IOs
Simple and complex banks IO features summary
Simple IO banks electrical parameters and performance
Complex IO banks electrical parameters and performance
Recommended “termination” parameter values for 50, 75 and 100 Ohms impedance
Digital design methodology
NG-FPGAs offer a very flexible architecture that allows the implementation of a wide range of applications. However, the user must understand that a safe and reproducible behavior can be guaranteed only if some simple but efficient and necessary design rules have been adopted during the design steps.
Into an FPGA, just like in any other electronic component, the internal logic and routing delays can vary across the Process, Voltage and Temperature variations (PVT).
However, NG-MEDIUM and the “nxmap” synthesis and implementation tools provide robust architecture as well as implementation procedures to ensure a safe and reproducible behavior against those delays variations, just by following some very simple design rules.
To guarantee the correct behavior, the clock(s) must be distributed by using dedicated routing resources in order to guarantee very low skew across the FPGA die. The NG-MEDIUM FPGA fabric is split in two zones or clock regions. Each zone can use up to 12 low skew signals.
Clock(s) and other low skew routing resources: “nxmap” automatically assigns low skew routing resources to the clocks of your design, to ensure that the delay at the clock input of all FFs will be controlled, and the skew (maximum delay difference at the destinations) will be low enough and predictable. The maximum clock skew across the FPGA fabric is then controlled by construction, and limited to some tens of picoseconds.
Note that the low skew network can be optionally used for other high fanout signals. This can be the case of some heavily loaded signals like RESET and LOAD_ENABLE, if they are applied to a large number of FFs across the FPGA. In this case the “low skew” feature is less important than the maximum delay allowed to reach a large amount of destinations. This maximum delay can also be guaranteed by the use of the low skew network.
Applying timing constraints to your design: The user can constrain the design to support the required clock(s) frequency. The timing constraints are specified in the “nxpython” script file.
By applying a “Period” constraint to the clock, the user specifies the maximum delay allowed from any FF output to any FFs input (using the same clock edge) after implementation. The “Timing Analyzer” is embedded into the synthesis and implementation tools. It interacts with each one of the synthesis and implementation processes, in order to find – if possible - a solution that meets the user’s defined timing requirements.
The user can generate timing reports to check that the timing constraints were met during the implementation process.
The synthesis and implementation tools manage two different kinds of timing:
Silicon process related timing: The maximum values are fixed by the foundry silicon process (always the same for a specified device and speed-grade). Most timing parameters are documented in the datasheet.
Clock skew: difference of timing delays at the destination FF and the source FF clock inputs. This delay difference can be positive or negative, but always limited to some tens of picoseconds by silicon process. The timing analyzer gives the exact clock skew value for each analyzed path.
Clock skew = Clock_delay@FF_dest – Clock_delay@FF_src
Tco: FFs clock to output delay. Defines the delay between the clock edge at the FF input, for the Q output to be stable and valid.
Tsu: FF setup time. Defines the amount of time required by the target FF to safely sample the incoming data value at its D input.
Tcomb: combinatorial logic and routing delays (LUTs, carry logic and other combinatorial elements).
Implementation related timing: The routing delays on the connections (nets) between sources and destinations. Those delays are defined by the “Route” process (depending on the used routing resources). Note that the routing is dependent of the placement.
“nxmap” synthesis and implementation algorithms try to find a solution that meet the user’s timing requirements specified by constraints. The user can generate timing reports to analyze the implemented results.
To guarantee a stable behavior over PVT variations, the following condition must be met:
Period >= Tco(source) + ∑Tnets + ∑Tcomb + Tsu(destination) + Clock skew
Among the analysis tools:
Timing Analyzer: allows the user to generate timing reports. The timing analyzer commands are documented in the “NanoXplore nxmap Python API” documentation. Among the timing report information:
Identification of the clock domains detected in the design
Slack (timing margin) to meet the specified constraint(s). If the timing constraints are met, the slack is positive. Otherwise, the slack is negative.
Detailed delays on selected path(s)
“nxmap” GUI: Graphical User’s Interface to analyze placement and routing results. See /wiki/spaces/~814749387/pages/48660481 documentation for detailed information. The user can observe the location of each IO port, tile logic elements, BRAMs, DSP blocks, PLL and WaveForm Generators and have a detailed or simplified view of the used routing resources.
Synchronous design methodology
Did you ever face a “haunted” design? This kind of design that surprisingly works “most of the time”… but fails “sometimes”, particularly during demonstrations or inaugurations?
Most of the time, if not due to PCB or power supplies issues, those random problems are due to:
Inadequate clock distribution (e.g.: local clock distributed by general routing resources), generally related with internal clock generation with logic.
Inadequate reset methodology
Local asynchronous SET and/or RESET conditions in the source code.
Lack of anti-meta-stability and resynchronization stages when crossing asynchronous clocks domains.
Missing timing constraints. “nxmap” will ignore the timing on unconstrained paths – and then doesn’t warn about possible timing errors.
Recommended clocking schemes
Single clock rising edge:
Whenever possible, this the most recommended, simplest and safer clock distribution scheme. This clocking scheme provides the easiest way for both user and tools to implement a safe design working at any frequency from DC to a maximum frequency determined by the timing analyzer (period limited by the longest path from FF to FF).
The internal timing can be constrained with a single timing constraint: “Period”. However inputs and outputs of the design might require additional constraints to specify the timing required on the inputs (setInputDelay) and the outputs (setOutputDelay), according the datasheet of the external components. See the timing constraints chapter for more information.
In order to reduce the FPGA internal clock propagation delay, the PLL + WFG (WaveFormGenerator) can be used to generate a ZERO Delay clock distribution. Cancelling the clock delay distribution reduces the FPGA clock to output pads, and reduces or eliminates potential hold time problems on the FPGA inputs.
Multiple synchronous clocks: Using dedicated clock management resources
NG-MEDIUM provides 4 sets of clock management resources (ClocK Generators or CKG), located at the corners of the die. Each CKG include one PLL and 8 WaveForm Generators.
The PLL reference input frequency can come from single ended or differential semi-dedicated clock input pins. At its outputs, the PLL can generate a wide range of clock outputs by applying frequency multiplication and/or division factors on the incoming input.
In order to reduce the FPGA internal clock propagation delay, the PLL + WFG can be used to generate a ZERO Delay clock distribution. Cancelling the clock delay distribution reduces the FPGA clock to output pads, and reduces or eliminates potential hold time problems on the FPGA inputs.
The WaveForm Generators (WFG) can be used as clock buffers. They provide direct routing to the low skew network. The WFG can also be used to generate clock dividers and user programmable patterns. See NG-MEDIUM datasheet for more information.
By combining PLL and WaveFormGenerators, the user can generate internally synchronous clocks, like for example:
Main_clock: same phase and frequency as the input clock pad
Higher_frequency_clock: a multiple of the input frequency (ex: Fin x 2)
Lower_frequency_clock: divided input clock frequency (ex: Fin / 2)
Those 3 clocks are synchronous together. Clock domain changes will be easily managed by the synthesis and implementation tools. No meta-stability or resynchronizing issues while timing constraints are met (more details on meta-stability issues on chapter 1.2.3).
Multiple asynchronous clocks: Meta-stability issues and resynchronization
When a signal synchronous of a clock is resampled by another asynchronous clock, it must be resynchronized to avoid unstable behavior due to meta-stability.
The meta-stability is an invalid logic level at a FF output, caused by a transition on the D input of the FF during its setup/hold window. This invalid logic level can cause incorrect behavior of your design.
When registering an asynchronous signal, the meta-stability phenomenon can’t be avoided, but simple design rules allow to cancel its negative and unreproducible effects.
Fortunately, the meta-stability doesn’t propagate from a FF to another, providing that the connection delay to the second FF is limited to a small fraction of the clock period.
Two cases must be considered.
Case 1: Resynchronizing a single signal (one bit):
In this example, the first FF will be subject to meta-stability. However, this invalid logic level will not be propagated to the next FF, particularly if the propagation delay between both FFs is short.
Case 2: Resynchronizing a multibit bus (two or more bits):
Multibit busses cannot be directly resynchronized just by applying the same technics to each one of the bus bits. Fortunately, a bus is qualified by an additional signal such as “DATA_VALID” or any other signal that indicates when the bus has a stable value. Thus, the user can safely resynchronize this control signal, and use the resynchronized version to sample the bus value,
The user must make sure that the clock frequency is high enough to sample the bus value while its value is still stable.
RESET methodology
The NG-MEDIUM internal Flip-Flops have a dedicated RESET input. The tile FFs can be reset synchronously or asynchronously, while the registers embedded onto BRAM and DSP blocs support exclusively synchronous reset.
Global reset
Can be synchronous or asynchronous – for the tile FFs. However, in any case, to guarantee a safe startup, the reset signal must be properly resynchronized on the design master clock, in order to avoid any meta-stability condition during the first active clock cycle.
Remember that BRAM and DSP registers can be reset synchronously only (no asynchronous reset available). However, independently of the synchronous or asynchronous usage of the reset, the risk of meta-stability during the first active clock cycle exists if the reset signal is not synchronous of the clock.
When using PLL for internal clock(s) generation, remember that the generated clocks are not safe during the PLL locking process. For a safe design startup, ensure your design is reset at least until the PLL locked status (RDY) is set.
A simple and efficient mechanism consists in delaying the RDY output of the PLL by some clock periods, as in the following source code sample:
signal RESET_DELAY : std_logic_vector(7 downto 0); signal INTERNAL_RESET : std_logic; begin process(CLK_generated_by_PLL_and_WFG) begin if rising_edge(CLK_generated_by_PLL_and WFG) then RESET_DELAY <= RESET_DELAY(6 downto 1) & not(RDY); end if; end process; INTERNAL_RESET <= RESET_DELAY(7);
RESET_DELAY is the delay line (8 steps of one clock period each). The last bit of the chain is used as INTERNAL_RESET. NanoXplore recommends to use at least two levels of registers on the reset delay line. It can be safely used as synchronous or asynchronous reset of the design. In any case, the timing constraints will cover all timing paths, including INTERNAL_RESET source to any Flip-Flop, including BRAM, DSP blocks and IO FFs.
in NG-MEDIUM, the “nxmap” implementation tools will use the low skew network – if possible - for the internal reset routing – taking in account the high fanout of this signal.
Local reset
For local RESET (to be applied only to a partial set of FFs), the synchronous way should be prefer. This gives more control of the routing and logic delays to the implementation tools.
Remember that an asynchronous reset is glitch sensitive.
Don’t apply both Asynchronous_SET and Asynchronous_RESET to the same Flip-Flop(s). The internal Flip-Flops have a dedicated synchronous or asynchronous RESET input only. The tile FFs reset can be synchronous or asynchronous, while the registers embedded onto BRAM, DSP support exclusively synchronous reset.
Things to avoid
Don’t use both clock rising and falling edges if not strictly necessary
Most designs can be implemented by using exclusively the clock(s) rising edges for the FPGA internal logic. This gives more flexibility and timing control to the implementation tools.
For DDR inputs and outputs (for example when using DDR SDRAM or some ADCs or DACs) NG-MEDIUM provides dedicated input and output DDR Registers into complex IO Banks.
Don’t use internally generated clocks if not strictly necessary
Internally generated clocks (by using combinatorial or registered logic) create race conditions that drives to unpredictable or unstable behavior.
Remember that internal clocks can be easily and safely generated synchronously with the main clock by using the NG-MEDIUM PLL and Waveform Generators.
The first figure illustrates the schematics of a portion of design, where an internal local clock is generated using a FF output. This creates a race condition, where the routing delays (of the data and the local clock) will impact the behavior (will be unstable over PVT variations). See the timing diagrams on second figure (case 1 and case 2). We can clearly see that the behavior will be routing dependent – and probably unstable over PVT.
Resynchronize the RESET signal on the clock domain
The tile Flip-Flops have a dedicated input for synchronous or asynchronous RESET.
Reset de-assertion is very critical
If the Reset signal (used as asynchronous or synchronous reset) is not synchronized on the FPGA clock, it can create setup violations on many Flip Flops during its de-assertion.
Risk of hazardous startup!!!
Timing constraints can’t help to avoid this problem
The RESET signal is propagated by using routing resources to the destination FFs. However, even if distributed by low skew lines, its de-assertion can be interpreted differently by the FFs, and can cause hazardous startup.
This issue can be easily overcome by resynchronizing the RESET input, using anti-metastability FFs.
The resynchronized RESET signal can be used as Asynchronous or Synchronous RESET.
If used as synchronous RESET, the implementation tools and the timing analyzer will control the propagation delays to ensure a predictable behavior at the specified frequency.
Avoid using asynchronous RESET if possible
The tile Flip-Flops have a dedicated input for synchronous or asynchronous RESET.
Asynchronous reset is glitch sensitive, while synchronous reset is part of your synchronous design, and then it’s covered by the period constraint – if generated synchronously to the clock domain. The implementation tools and the timing analyzer will control the propagation delays to ensure a predictable behavior at the specified frequency.
Don’t use asynchronous SET
There is no dedicated asynchronous SET input on the tile FFs. However synchronous SET can be easily and safely implemented by combining LUT + FF of the same FE (NG-MEDIUM logic cell that includes one 4-input LUT and on D Flip Flop.
However, “nxmap” synthesis tools can build the behavior of asynchronous set at the cost FF, extra logic resources mapped to LUTs in another FE and additional routing delays (uses more logic resources, and poor performance in terms of power consumption and working frequency).
Don’t use asynchronous initialization from a given value (signal or constant)
Asynchronous initialization from a signal value will prevent the synthesis and implementation tools from using dedicated flip-flops. Combinatorial loops can be generated. The resulting behavior can be unpredictable.
Example of source code to be avoided:
process(CLK, INIT) begin if INIT = ‘1’ then -- Asynchronous initialization DATAR <= DATA_IN; -- Assigned value is not a constant elsif rising_edge(CLK) then if ENA = ‘1’ then DATAR <= CNT; end if; end if; end process;
Instead, the following code will be prefer:
process(CLK) begin if rising_edge(CLK) then if INIT = ‘1’ then DATAR <= DATA_IN; -- Synchronous initialization elsif ENA = ‘1’ then DATAR <= CNT; end if; end if; end process;
Writing efficient HDL source code
The quality of the source code is the most important factor to ensure an efficient, stable and predictable design.
Whenever possible, use a simple, compact and clear writing style.
HDL synthesis is the first step of the implementation process.
If for any reason, the synthesis results are not optimized enough for your design requirements, there will be no way to change this during the subsequent mapping, place and route processes.
The HDL source code is probably your main investment for maintainability, design density, power reduction and performance optimization
Source code must be optimized for the targeted architecture
As much as possible, it must be also flexible and portable (to other architectures or synthesis tools)
Readability is another very important factor
Write a direct, simple and clear source code
The more compact, the more readable
The synthesis tools can also make a better translation to take advantage of the silicon features when the source code is compact and clear
Avoid using combinatorial processes if not necessary
Combinatorial processes can generate latches and combinatorial loops. This can led to unpredictable or unstable behavior, and have a negative impact on logic and routing resources utilization.
Be very careful if you have to write combinatorial processes.
Avoid declaring and using un-necessary combinatorial signals
In a synchronous design, the combinatorial signals are registered with FFs.
Generally, it’s simpler, faster and more efficient to use a single process to define the global (combinatorial logic and register) in a single clocked process.
Don’t declare un-necessary signals if those signals must be registered
All NG-MEDIUM configurable elements have their own Flip Flop (tile logic, BRAMs, DSP Blocks, and IOs). The synthesis automatically will recognize that the function can be implemented in the same elements (by packing combinatorial logic and the FFs into the same logic element such as Functional Elements, BRAMs or DSP blocks)
Reduced code size and improved readability
Apply this method also for state machines (you will avoid timing and implementation problems)
Have a look on the following VHDL source code that describes a pipelined adder-multiplier function.
signal A, B, C: std_logic_vector(15 downto 0); -- A, B and C inputs signal A_REG, B_REG, C_REG: std_logic_vector(15 downto 0); -- registered inputs signal A_PLUS_B: std_logic_vector(15 downto 0); -- combanitorial added output signal A_PLUS_B_REG: std_logic_vector(15 downto 0); -- registered adder output signal MULT: std_logic_vector(31 downto 0); -- combanitorial multiplier output signal MULT_REG: std_logic_vector(31 downto 0); -- registered multiplier output begin process(CLK) begin if rising_edge(CLK) then A_REG <= A; B_REG <= B; C_REG <= C; end if; end process; A_PLUS_B <= A_REG + B_REG; process(CLK) begin if rising_edge(CLK) then A_PLUS_B_REG <= A_PLUS_B; end if; end process; MULT <= A_PLUS_B_REG * C_REG; process(CLK) begin if rising_edge(CLK) then MULT_REG <= MULT; end if; end process;
The same behavior can be written as follows. Readability is increased.
signal A, B, C: std_logic_vector(15 downto 0); -- A, B and C inputs signal A_REG, B_REG, C_REG: std_logic_vector(15 downto 0); -- registered input signal A_PLUS_B_REG: std_logic_vector(15 downto 0); -- registered adder output signal MULT_REG: std_logic_vector(31 downto 0); -- registered multiplier output begin process(CLK) begin if rising_edge(CLK) then A_REG <= A; B_REG <= B; C_REG <= C; A_PLUS_B_REG <= A_REG + B_REG; MULT_REG <= A_PLUS_B_REG * C_REG; end if; end process;
At this time, we do not take in consideration the rules for signed or unsigned operations. This example is just to show that there is an easy and compact way to describe the same functionality with very few lines.
For arithmetic and/or DSP functions, see the chapter DSP blocks.
Use appropriate sensitivity list
Un-appropriate sensitivity list can drive to synthesis/simulation misunderstanding. The simulation behavior will not necessarily match the implementation results.
For simple clocked processes (with NO asynchronous reset or initialization), only the clock signal must be in the sensitivity list
process(CLK) begin if rising_edge(CLK) then …… -- assignments end if; end process;
For clocked processes with asynchronous re-initialization of FFs (not recommended), clock and reset (or any other asynchronous signal) must be in the sensitivity list
process(CLK, RST) begin if RST = ‘1’ then …… -- signal assignments to ‘0’ or (others => ‘0’) elsif rising_edge(CLK) then …… -- assignments end if; end process;
For combinatorial processes (not recommended) all the involved signals in the assignments must be in the sensitivity list (does not prevent generation of latches).
Be careful when using relational operators
VHDL allows to compare busses with different number of bits.
The VHDL relational operators =, <, >, <= and >= work in a very surprising way when the number of operands bits doesn’t match.
In addition, synthesis and simulation can have different interpretations, depending on the context.
Conclusion: Make sure the busses length are the same in both operands
Avoid using un-necessary variable in the synthesizable source code
If not strictly necessary, it’s safer and better to use signals instead of variables:
In synthesis, most often, variables do not have a logical representation in the synthesized netlist or in simulation.
Very often, using variables can make more complex the synthesizable source code – for synthesis as well as simulation) while providing no benefit
Conclusion: use signal instead of variable whenever is possible.
Don’t declare ports and signals as integers – if not necessary
The packages std_logic_unsigned and std_logic_signed provide direct type conversion, and additional functions. For example you can simply write :
CNT <= CNT + 1; -- CNT is a std_logic_vector
Additional conversion functions are provided within the std_logic_arith package
Conv_integer(std_logic_vector_signal)
Converts the value of the signal declared as a std_logic_vector to its equivalent integer value (according the used std_logic_unsigned or std_logic_signed package).
Conv_std_logic_vector(integer_value, std_logic_vector_number of bits)
Converts an integer value to its equivalent std_logic_vector value. The number of bits is specified as the second argument (according the used std_logic_unsigned or std_logic_signed package).
Conclusion: use std_logic_vector unless exception.
Don’t use inactive or redundant assignments
In VHDL, if no condition is found to assign a new value to a signal, the signal will keep its previous value
In a combinatorial process, assigning a condition to maintain the previous value to a signal will produce latches and/or combinatorial loops.
In a clocked process if a condition enables some assignments, the dedicated FFs load or clock_enable will be used
Don’t use the null statement if not strictly necessary
Leave your source code as compact as possible, taking advantage of the standard packages and the synthesis rules. In the following simple example, the lines in green do not help for anything. Instead, in some more complex cases, they could prevent the synthesis tools from optimizing the resulting netlist.
process(CLK) begin if rising_edge(CLK) then if ENA = '1' then SIG_R <= SIG_IN; -- else -- SIG_R <= SIG_R; end if; end if; end process;
Other tricks and tips for a more compact and re-usable source code
Use (others => ‘0’) when several bits are assigned to the same value.
Example for reset condition :
REG_DATA <= (others => ‘0’);
Works for any bus length
Example for High impedance output buffers :
if WRITE_EXT_MEM = ‘1’ then DOUT <= REG_DATA; else DOUT <= (others => ‘Z’); end if;
Use generic parameters whenever your module can be re-used with different parameters. This is particularly useful for memory and DSP functions. Example :
library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_unsigned.all. use IEEE.std_logic_arith.all; entity MY_MEMORY is generic ( Number_of_address_bits : integer := 10, Number_of_data_bits : integer := 16 ); port ( CLK : std_logic; DIN : std_logic_vector(Number_of_data_bits-1 downto 0); WE : std_logic; ADR : std_logic_vector(Number_of_address_bits-1 downto 0); DOUT : std_logic_vector(Number_of_data_bits-1 downto 0) ); End MY_MEMORY; architecture ARCHI of MY_MEMORY is type MEM_TYPE is array((2**Number_of_address_bits)-1 downto 0 of std_logic_vector(DIN’range); signal MEM : MEM_TYPE; begin process(CLK) begin if rising_edge(CLK) then if WE = '1' then MEM(conv_integer(ADR)) <= DIN; else DOUT <= MEM(conv_integer(ADR)); -- DOUT doesn’t change end if; -- during write end if; end process; end ARCHI;
This same module can be instantiated several times in your design with different configuration for each instance, by assigning individual parameters sets to each instance (Number_of_address_bits and Number_of_data_bits assigned by generic map).
Take advantage of the “std_logic_arith” and “std_logic_unsigned” or “std_logic_signed” package for a more compact and more readable source code, thanks to the implicit and explicit conversion functions.
Note that the package “numeric_std” is also supported by “nxmap”.
Design hierarchy
Must be organized in a logical way. Avoid mixing unrelated functions into the same hierarchical module. It will be easier to assign synthesis options and directives.
Hierarchical modules can be synthesized separately to verify the quality of results such as :
Used logic resources (tile logic, BRAM, DSP blocks…)
Timing performance for the considered module.
Register all the outputs of the hierarchical modules – if performance is required
Naming rules of entities, component labels and signals:
Good and clear naming rules greatly improves the readability
Short but clear names can ease the recognition on debug and verification tools (timing analyzer, NXmap GUI, simulation)
However, too short names can drive to confusion or additional difficulties to recognize some elements (signals, entities, components…)
Be careful with reserved words or other words commonly used (WRITE, READ, BUS, COUNT…).
during the synthesis process, the FF outputs signals are automatically renamed by the tools, by adding “_reg” to the original name defined in the source code. As an example, a signal called “DATA_CHANNEL” in the source code will be renamed “DATA_CHANNEL_reg” after synthesis if it’s generated by a FF or group of FFs. Be careful not to name any other non registered signal “DATA_CHANNEL_reg” to avoid possible post-synthesis conflicts.
Signal names should reflect their polarity. For example:
RST is an active high signal (resets the FFs when it goes high, while RST_N is active low.
LOAD is active high (loading occurs when LOAD is high), while LOAD_N is active low.
Inference vs instantiation:
Inference advantages and limitations
The inference describes the behavior of the functions to be synthesized and implemented, with standard HDL description. As a result, the source code is portable and can be used with other architecture or tools.
Synthesis and mapping options: Most high performance NG-MEDIUM cells can be inferred with a very simple source code and implemented as desired. However, by using some mapping options, the user can have a better control over the synthesis and mapping processes, providing that the described functionality matches the FPGA elements behavior. This is particularly true for RAM inference – that could be virtually implemented with RAM blocks or Register_file (RF). See the “addMappingDirective” in the NanoXplore NXmap Python API documentation for more information.
However, some NG-MEDIUM built-in functions cannot be inferred. This is the case for example of the PLL, WaveformGenerators using patterns, some RAM and DSP blocks configurations, high performance IOs using DDR or SERDES.
In such cases, it might be necessary to instantiate the primitives.
Instantiation of NG-MEDIUM primitives
NG-MEDIUM primitives can also be instantiated:
Register_file
RAM blocks
DSP blocks
PLL and WFG
Although memories (Register_file and BRAM) can be easily and efficiently inferred with simple and portable source code for most common functions, the user might prefer to instantiate the elements. The source code is not portable, but the user can get access some features that are not necessarily accessible by inference.
In addition, some primitives cannot be inferred. They can be used only by instantiation.
PLL most often combined with WaveForm Generator(s)
Dual port RAM with different bus sizes on both ports
Some DSP blocks configurations
Other primitives
See the “Library guide” for more information.
NG-MEDIUM architecture survey
Before starting your design, it’s important to acquire a proper understanding of the NG-MEDIUM architecture. From the user’s point of view, we can separate the architecture in three main blocks:
The user’s IO ring (organized in IO banks). Includes flexible single ended and/or differential IOs, DDR registers, SpaceWire compatible interfaces IO blocks, and many more features (input and output calibrated delay lines, output serializers, input de-serializers…).
There are also 4 clock generators also called CKG (PLL + WaveFormGenerators – one set in each FPGA corner of the die).
The FPGA core logic offers
Tile logic for combinatorial or registered functions, arithmetic, register_files (64 x 16-bit synchronous simple dual port memories – with EDAC)
Flexible 48Kbit synchronous true dual port memory blocs (including user’s selectable EDAC)
DSP blocs for high performance complex DSP functions
The FPGA configuration logic and dedicated IO interface
Please, refer to the next chapters as well as NG-MEDIUM data sheet for detailed information.
NG-MEDIUM inputs and outputs (IOs)
The NG-MEDIUM user’s IOs are organized in 13 IO banks. Each IO bank have a single power supply (Vddio) for all the IOs into the same bank.
The top and bottom banks are called “complex”. The complex IO banks provide more flexibility and performance than the left and right banks that are called “simple”.
All IOs can be configured as input, output or bi-directional IO.
The next figures show the IO banks location, and the numbering of the die IO blocks. For physical pin numbers on the selected package, please consult the NG-MEDIUM datasheet. Currently, NG-MEDIUM is available in three different packages:
LGA625 : Land-Grid array 625 pins
CGA625 : ceramic column-Grid array 625 pins
CQFP352 : ceramic Quad-flat package 352 pins
In addition another bank located to the left of the FPGA die is used for FPGA configuration. This document doesn’t cover the configuration process. Please, refer to the NG-MEDIUM datasheet for detailed information about the FPGA configuration modes and pins.
Available user’s IOs
LGA/CGA 625 packages:
All the 374 die user’s IOs are available
Bank | Type | I/Os | Location | Bank | Type | I/Os |
0 | Simple | 22 | Left | 1 | Simple | 22 |
2 | Complex | 30 | Bottom | 3 | Complex | 30 |
4 | Complex | 30 | Bottom | 5 | Complex | 30 |
6 | Simple | 30 | Right | 7 | Simple | 30 |
8 | Simple | 30 | Right | |||
9 | Complex | 30 | Top | 10 | Complex | 30 |
11 | Complex | 30 | Top | 12 | Complex | 30 |
LGA/CGA 625 packages – available IOs
CQFP 352 package:
Only 192 of the 374 die user’s IOs are available
Bank | Type | I/Os | Location | Bank | Type | I/Os |
0 | Simple | 14 | Left | 1 | Simple | 12 |
2 | Complex | 30 | Bottom | 3 | Complex | - |
4 | Complex | - | Bottom | 5 | Complex | 30 |
6 | Simple | 22 | Right | 7 | Simple | - |
8 | Simple | 24 | Right | |||
9 | Complex | 30 | Top | 10 | Complex | - |
11 | Complex | - | Top | 12 | Complex | 30 |
CQFP 352 package – available IOs
Simple and complex banks IO features
Each IO is composed of two different and complementary elements:
The IO buffers (primitives NX_IOB_x): can be used as input, output or bi-directional. Single ended I/Os have a default 10K to 40K PullUp. The IO buffers can be configured to work in a wide range of single ended and differential electrical standards (LVCMOS, SSTL, HSTL, and LVDS). The IO buffers can be instantiated or inferred. They also can be parametrized in the “nxpython” script to adapt their electrical configuration to meet the board design requirements. Among parameters :
Output drive
Slew rate (for LVCMOS outputs)
Turbo mode (faster LVCMOS input)
Optional 2K to 6K PullUp
Optional adjustable termination (only for complex banks – SSTL, HSTL, LVDS)
The sequential input and output elements (input, output and tri-state control FF). It includes the following elements :
Single flip-flop on input, output and tri-state paths
Optional adjustable delay lines on the input, output and tri-state paths (0 to 63 x 160 ps delay)
The NG-MEDIUM IO ring is segmented into 13 IO banks. The left (B0 and B1) and the right (B6, B7 and B8) IO banks offer flexible but limited features. They are called “simple” banks.
Instead, the top (B9, B10, B11 and B12) and bottom banks (B2, B3, B4 and B5) are called “complex” banks and offer more electrical and functional features.
The following table summarizes the main IOs features available into complex and simple IO banks.
Feature | Complex | Simple |
Number of IOs | 30 | 30/22 |
Power supply (Vddio) | 1.8, 2.5 or 3.3v | 1.8, 2.5 or 3.3v |
Supported IO standards | LVCMOS, SSTL, HSTL, LVDS | LVCMOS, SSTL, HSTL, LVDS |
Single DFF (in, out, tri-state) | Yes | Yes |
Differential SSTL/HSTL | Yes | No |
LVDS | Yes | Yes (no internal input termination) |
Resistive input termination | Yes | No |
Programmable input/output delay | Yes | Yes |
CDC (Clock Domain Changer) | Yes | No |
Shift Register | Yes | No |
DDR mode | Yes | No |
SpaceWire | Yes | No |
Simple and complex banks IO features summary
Simple banks: Electrical standards and supported electrical parameters
Standard | Type | Bank supply | Drive | Speed | Special considerations |
LVCMOS 3.3V | SE (*) | 3.3 V | 2–16 mA | 100Mb/s | In NG-MEDIUM, single ended I/Os have an internal default 10K to 40K PullUp. In addition the user can active a slower value (2K to 6K) optional PullUp Slew rate SLOW/FAST for outputs Turbo mode for inputs (faster inputs at the cost of higher static power) (Those electrical parameters can be set by constraints in a script file) |
LVCMOS 2.5V | SE (*) | 2.5 V | 2–16 mA | 300Mb/s | |
LVCMOS 1.8V | SE (*) | 1.8 V | 2–16 mA | 300Mb/s | |
LVDS 2.5V | DIF(*) | 2.5 V | 3.5mA | 800Mb/s | No internal termination available |
Simple IO banks electrical parameters and performance
(*) SE = single ended, DIF = differential |
Complex banks: Electrical standards and supported electrical parameters on:
Standard | Type | Supply | Drive | Speed | Special considerations | Notes |
LVCMOS 3.3V | SE | 3.3 V | 2–16 mA | 100MHz | In NG-MEDIUM, single ended I/Os have an internal default 10K to 40K PullUp. In addition the user can active a slower value (2K to 6K) optional PullUp Slew rate SLOW/MEDIUM/FAST for outputs Turbo mode for inputs (faster inputs at the cost of higher static power) (Those electrical parameters can be set by constraints in a script file) | |
LVCMOS 2.5V | SE | 2.5 V | 2–16 mA | 300MHz | ||
LVCMOS 1.8V | SE | 1.8 V | 2–16 mA | 300MHz | ||
SSTL_2.5V_I | SE(*) | 2.5 V | 8 mA | 600Mb/s | Controlled source impedance SSTL and HSTL standards require using dedicated VTO pins (~VDDIO/2) in respective IO banks SSTL and HSTL support differential mode | DDR SDRAM |
SSTL_2.5V_II | SE(*) | 2.5 V | 13 mA | 600Mb/s | ||
SSTL_1.8V_I | SE(*) | 1.8 V | 8 mA | 600Mb/s | DDR2 SDRAM | |
SSTL_1.8V_II | SE(*) | 1.8 V | 13 mA | 600Mb/s | ||
HSTL_1.8V_I | SE(*) | 1.8 V | 8 mA | 800 Mb/s | DDR2 SDRAM | |
HSTL_1.8V_II | SE(*) | 1.8 V | 16 mA | 800 Mb/s | ||
LVDS 2.5V | DIF(*) | 2.5 V | 3.5mA | 800Mb/s | Embedded optional impedance adaptation |
Complex IO banks electrical parameters and performance
IO Standards usage
LVCMOS
Among electrical settings for LVCMOS standards (simple and complex banks):
Drive current : can be set to 2, 4, 8 or 16 mA
Slow or fast slew rate
Turbo mode : for faster input buffers (at the cost of higher power consumption)
Optional 2K to 6K PullUp.
Those electrical parameters must be defined in a “nxpython” script file.
SSTL and HSTL
SSTL and HSTL IO standards are supported only on complex banks
Electrical settings for SSTL and HSTL:
Complex banks IOs can be configured to SSTL or HSTL standards
The output buffers are current drivers and do not require external adaptation. However, the receiver side must have the convenient impedance adaptation.
The IOs using SSTL or HSTL standards must use a reference voltage for the input comparators. Vref (typically VDDIO/2) is generated internally – on complex banks exclusively. (Not available on simple banks).
Internal input impedance adaptation: SSTL and HSTL standards require using the VTO pins in the IO banks to make possible the internal input impedance adaptation. The VTO pins must be connected to an external voltage (usually VDDIO/2), able to sink/source up to 16 mA by IO configured for internal impedance adaptation.
See the next table for VTO recommended voltage values :
VDDIO | VTO nom (+/- 5%) |
1.8 V | 0.9 V |
2.5 V | 1.25 V |
3.3 V | 1.65 V |
VTO range vs VDDIO
For SSTL_2.5V, SSTL_1.8V and HSTL_1.8V, VDDIO 2.5 V and/or 1.8 V, on-chip termination can be activated for all pads in a given bank.
In the case of SSTL_2.5V (VDDIO 2.5 V), on-chip termination can be activated on maximum 11 pads due to the limitation on VTO power supply rails.
The value of the input impedance adaptation resistors can be adjusted in the “nxpython” script file by assigning an integer 0 to 15 value for the “termination” parameter (see the figure 12 for the required value), or it can be specified with a value on Ohms. See the NanoXplore “nxmap” Python API documentation for syntax details.
The graphs on figure 12 show the expected resistor values in function of the parameters assignments versus VDDIO.
VDDIO / Impedance | 1.8 V | 2.5 V | 3.3 V |
50 Ohms | 5 | 4 | 5 |
75 Ohms | 10 | 9 | 11 |
100 Ohms | 15 | 15 | 15 |
Recommended “termination” parameter values for 50, 75 and 100 Ohms impedance
SSTL and HSTL differential modes:
Input: The differential SSTL and HSTL mode use a differential comparator. The input impedance adaptation resistors are connected together to VTO.
Note: VREF is not used on SSTL and HSTL differential modes
Output: The differential mode uses both output buffers of the same IO pair.
Bidirectional SSTL/HSTL
LVDS:
LVDS electrical standard is allowed on both simple and complex IO banks
However, only complex IO banks can have internal input termination.
For LVDS, input impedance adaptation does not requires using the VTO pins, because the impedance adaptation resistor is connected between the two complementary pads.
Just like for other single ended standards, the input impedance is adjusted by setting the parameter ‘termination’: ‘50’ (value in Ohms) in the script file. Note that if for example the resistor value is 50 Ohms, in differential mode, the global value will be 50 + 50 Ohms = 100 Ohms.
As LVDS output buffers are current drivers, output termination is no required. Simple and complex banks IOs can implement LVDS output buffers with no other requirement than setting the “LVDS_2.5V” standard in a script file.
Basic IO logical structure (simple and complex banks)
All IOs provide a common set of useful features, in simple and complex IO banks.
The input path can be combinatorial or registered
The output path including the tri-state control can also be combinatorial or registered.
On simple banks, all IO’s internal FFs have dedicated RST (synchronous or asynchronous reset), and LD (LoaD enable) input pins.
On the complex banks, the DFFs have a dedicated synchronous or asynchronous SET or RST pin.
Any one of those logical configuration of the IOs can be inferred by “nxmap”. No need of instantiation.
All electrical parameters can be specified in a script file :
‘location’: IO location identification (ex : ‘IO_B10D09P’ or ‘IOB10_D09P’)
‘type’: IO standard and output drive (ex : ‘LVCMOS_3.3V_4mA’)
‘weakTermination’: none, pullup
‘slewRate’: ‘slow’, ‘medium’ or ‘fast’
‘termination’: Value in Ohms. (ex : ’50’), or integer 0 to 15 (ex : ‘:13’)
‘turbo’ mode for input buffers : ‘true’ or ‘false’
‘inputDelayLine’: integer. Number of delay steps
outputDelayLine: integer. Number of delay steps
Simple banks advanced IO configuration
Additional functional features for all (simple and complex) IO banks
Input delay line: allows to delay the incoming signal by a 64-step delay line (steps of 160 ps).
Output delay line: allows to delay the pad output signal by a 64-step delay line (steps of 160 ps).
Tri-state control delay line: allows to delay the pad output signal by a 64-step delay line (steps of 160 ps).
Note: The input, output and tri-state control delay lines can be defined in the “nxpython” script file.
SERializers and DESerializers on complex banks
Introduction
The NG-MEDIUM complex I/O banks provide serializers and deserializer features.
Each serializer or deserializer can use a serialization factor of 2, 3, 4 or 5.
In addition in each I/O pair, the serializers/deserializers associated to the “_P” pad can be chained with its neighbor (associated to the “_N” pad) allowing thus serialization/deserialization factors of 6, 7, 8, 9 and 10.
The serializer/deserializer data path requires two clocks: bit clock (Fast clock – FCK) and word clock (Slow clock – SCK).
Serializers include optional output delay lines for both output and tri-state command. Although output delay lines can be dynamically controlled, serializer delays are usually configured in static mode – most often no delay.
Deserializers require a proper data/clock alignment mechanism for safe sampling, as well as word alignment to recover the original words. Data/clock alignment requires a dynamic control of the delay lines to adjust the phase relationship of the sampled data and the fast clock. This procedure is called Dynamic Phase Alignment (DPA). It requires a training sequence.
NG-MEDIUM complex IO banks provide hardware support for DPA. The dynamic control of the delay lines requires an additional clock to read or write into the delay registers. This clock is called DCK. It can be synchronous or asynchronous with the data path clocks SCK and FCK.
All I/O related delay lines have 0 to 63 x 160 ps steps delays.
SERDES architecture overview
The serializers/deserializers architecture contains two main blocks :
Data path
Delay control path
DPA : Dynamic Phase Adjustment
NG-MEDIUM architecture provides hardware support for Dynamic Phase Adjustment on NX_DES. The following describes how to implement the adjustment procedure.
In the complex banks, all I/Os include three user’s selectable and adjustable delay lines that can be selected with “DS(1:0)” sub-address input. Those registers can be read and written with a simple microprocessor-like interface.
Each NX_DES or NX_SER include three delay lines, respectively for output (and tri-state control), input path and DPA path delays. The delay value on each one of those three paths (number of 160 ps delay taps) is defined by the value written into the corresponding delay register.
The output (and tri-state control) delay register controls the delay inserted on the output data path.
The input delay register controls the delay inserted between the input pad and the input register.
The DPA delay register controls the delay inserted between the input pad and the DPA input register (NX_DES only).
The DPA logic allows to generate flags (FLD and FLG) to inform about the data input and fast clock relative phase. Note that those flags are available only if the DIG input is high (active low multicast that allows to simultaneously write in the DS(1:0) selected registers of all I/Os in the same complex bank):
FLD : this flag goes high when a transition on the data line at the output of the DPA delay line, occurs between the falling and the rising edge of the sampling clock (FCK, fast clock).
FLG : this flag goes high when a transition on the data line at the output of the DPA delay line, occurs between the rising and the falling edge of the sampling clock (FCK, fast clock).
FZ : Active low flags reset
By modifying the value of the DPA delay line, data/clock relationship can be analyzed by monitoring the FLD and FLG flags – and then deduce the optimal delay value to be written to the input delay register of NX_DES.
Important note : Particularly for NX_DES, the delays calibration process is possible only if the inputDelayLine is assigned with an empty string (no character or space between “”), and the DIG input must remain high.
Example : inputDelayLine => ““,
The following figures show the FLD and FLG flags behavior versus the transition detection on the DPA delay line output.
Write and read accesses to the delay registers can be easily managed with the following signals :
DCK : delay registers clock (can be asynchronous with SCK/FCK). Usually 2 to 20 MHz. Write operations occur on DCK rising edge.
DID(4:0) : address identifier of the considered I/O in the complex bank (0 to 29).
DRA(4:0) : address of the I/O in the considered complex bank (0 to 29). Note that when DRA = DID, the DRO outputs, as well as FLD and FLG flags outputs of the considered I/O go to low impedance (allowing thus to be read by the fabric).
DS(1:0) : allow to select the destination register into the DRA selected I/O. See next table for details.
DS value | Selected delay register |
00 | Output (and tri-state control) delay register |
01 | Input delay register |
10 | DPA delay register |
11 | Reserved |
DRI(5:0) :value to be written into the selected register.
DRL : active high load (write enable)
DIG : active low multicast write. Must remain high for register by register access, and corresponding FLD / FLG activation.
For deserialization, NanoXplore provides an IP Core that automatically adjusts the delay lines in order to properly align the sampled data with the fast clock. It also provides word alignment.
NanoXplore recommends to use the NXcore IP Core generator for SER/DES application development.
IO features inference and instantiation:
Inference:
The most common IO configurations can be inferred, providing that the electrical parameters (IO standard, drive current and other electrical parameters) are specified together with the pads locations in a script file.
Direct input, output and tri-state control for single ended and differential IOs (not registered)
Registered input, output and/or tri-state control (by setting the “MergeRegisterToPad” constraint in the nxpython “setOptions” command)
The electrical standard (LVCMOS, SSTL, HSTL, LVDS…) and other electrical parameters are user’s specified in a Python script file, and can be also specified in the VHDL source code – using the “NX_PORT” attribute(see the chapter 8.2 for specific information about supported synthesis attributes).
Instantiation:
NanoXplore provides several primitives for direct I/O resources instantiation :
NX_IOB : bi-directional I/O
NX_IOB_I : Input only (pad to fabric)
NX_IOB_O : output only (fabric to pad)
NX_SER : complex bank SERializer I/O
NX_DES : complex bank DESerializer I/O
Please refer to the NanoXplore_Library_Guide for detailed information.
IO blocks assignments with “nxpython” script
Location and electrical parameters: addPad(s)
“nxpython” provides the way for defining IO electrical standard in a script file, as well as pin location and additional electrical parameters.
In addition, the pads assignment can also be done in the VHDL source code by using the “NX_PORT” attribute. Please, refer to the chapter “Synthesis attributes” for more information and usage examples.
The script commands can be defined on two ways: simplified and detailed. Single pads can be assigned with the “project.addPad” command, while “project.addPads” allows assigning multiple pads.
Simplified pads assignment: This commands allow to define the IO location as well as the I/O standard, including the drive current (the drive current applies only to outputs).
Pad_Name : Pad_Location, IOStandard
Example:
project.addPads({ 'CNT_R_160[0]' : ('IOB1_D01P', 'LVCMOS_2.5V_2mA'), 'CNT_R_160[1]' : ('IOB1_D02P', 'LVCMOS_2.5V_2mA'), 'CNT_R_160[2]' : ('IOB1_D03P', 'LVCMOS_2.5V_2mA'), 'CNT_R_160[3]' : ('IOB1_D04P', 'LVCMOS_2.5V_2mA') })
To specify the slew rate of the output IOs, and additional electrical parameters, the detailed version of this command must be used.
Detailed pads assignment :
Pad_Name : Pad_Location, IOStandard & drive, weakTermination, Termination, slew_rate, turbo, differential, input_delay_line, output_delay_line…)
Example:
project.addPad('padName',{ 'location' : 'IOB0_D01P', 'type' : 'LVCMOS_2.5V_2mA', # String 'weakTermination' : 'None', # String : 'PullUp' or 'None' # Only for NG-LARGE and NG-ULTRA 'slewRate' : 'Medium' # String : 'Slow', 'Fast' 'termination' : ':0', # String : ':0' = No Termination # ':1' to ':15' see Fig 12 # or '50' for value in Ohms 'inputDelayLine' : 0, # Integer : 0 to 63 'outputDelayLine' : 0, # Integer : 0 to 63 'differential' : False, # Boolean : True 'terminationReference': 'Floating',# string : Floating or VT 'turbo' : False, # Boolean : True 'inputSignalSlope' : 2, # Integer 1 to 20 (V/ns) 'outputCapacity' : 25 # Integer 0 to 20 (pF) })
“nxmap” assigns the same value to “tristateDelayLine” and “outputDelayLine”.
“inputSignalSlope” and “outputCapacity” do not modify any IO parameter, but are taken in account for timing analysis.
IO flip-flops: setOptions (‘MergeRegisterToPad’)
The flip-flops directly connected to the ports of the design top (input, output or bidirectional ports) can be mapped onto the IOs, upon user’s constraint in a “nxpython” script file, by using the setOptions 'MergeRegisterToPad' constraint.
project.setOptions({ 'UseNxLibrary': 'Yes', 'MergeRegisterToPad': 'Always', #'Never', 'Input', 'Output' 'MultiplierToDSPMapThreshold': '1', 'ManageUnconnectedOutputs': 'Ground', 'ManageUnconnectedSignals': 'Ground', 'AdderToDSPMapThreshold': '1', 'DefaultRAMMapping': 'RAM', # 'RAM_ECC', 'RF' 'MappingEffort': 'High'})
NG-MEDIUM clocks
Semi-dedicated clock inputs
The next figure shows the die location of the clock management resources and the pin location of the specialized clock inputs. The clock inputs can be single ended or differential on simple as well as in complex banks. The internal resistive termination is available only in the complex banks.
When using a single ended clock input (not differential), the P pin must be used. Example in bank 12: Use IOB12_D08P for single ended clock input, and the pair IOB12_D08P + IOB12_DO08N for differential input.
The clocks are distributed by using a low skew network to guarantee homogeneous clock distribution over the FPGA die. There are two clock regions (left and right) into the FPGA fabric. Up to 12 low skew signals can be distributed in each clock region.
Using semi-dedicated clock inputs, guarantees optimized access to the global low skew network, to keep the FPGA internal clock distribution delay as low as possible while offering very low skew.
There are two methods to efficiently use the clock distribution network
Inference: when the clock(s) input pins are not specified by the user in the synthesis/implementation directives), “nxmap” implementation tools automatically assign the clocks inputs to some available semi-dedicated pins, and distribute them by using the Low Skew network.
If the user specifies the pin number for the clock input(s), it must correspond to one of the semi-dedicated pins.
A WaveForm Generator (WGF) is automatically inferred for each clock. In this case, the WFGs are used as clock buffer(s).
Instantiation: If the usage of PLL is required to generate internal clocks, or simply to reduce the internal clock propagation delay, the user must instantiate the PLL(s) and WaveForm Generator(s). See “Using clock management resources” and/or “NX Library Guide.pdf” for detailed information.
Clock management and distribution
NG-MEDIUM architecture provides a robust and flexible clock management and distribution scheme.
Low skew routing network overview
Up to 12 low skew networks are available in each one of the two clock regions.
The “nxmap” implementation tools use primarily the low skew network for clock distribution, to build the clock(s) tree(s).
Alternately, some other high fanout signals can also be distributed by using the low skew network. This is the case for example of RESET or LOAD_ENABLE signals. The RST and LE pins of the FEs, BRAMs and DSP blocks have a direct connection to the low skew network.
The low skew network is distributed inside the NG-MEDIUM fabric by the mean of a clock switch, located in the center of the FPGA, to ensure homogeneous distribution over the die.
The clock switch is fed by the clock sources and other high fanout signals. Then those signals are distributed to the FPGA core and I/O banks, using the low skew network.
Up to 12 low skew branches are available in each clock region.
CKG blocks (ClocK Generators)
There are 4 Clock Generators blocks in NG-MEDIUM, one in each corner. Each CKG includes one PLL and 8 Waveform generators (WFG).
The PLL can be used to generate internal clocks at frequencies that are based on the REF clock input (with multiplication and/or division factors).
The WaveForm Generators are primarily used as buffers to reach the low skew network, but offer additional flexibility to change the clock polarity or generate up to 16-tap clock patterns. A user’s selectable and adjustable calibrated delay can also be used into the WFGs.
The WFG are organized as three groups with specific connectivity on their outputs.
“nxmap” synthesis and implementation tools automatically map the convenient WFGs and associated low skew routing resources, depending on the design requirements and user’s constraints. See chapter Clocks distribution details.
WFG (WaveForm Generator)
The WaveForm Generators are primarily used as buffers to reach the low skew network. There are eight WFGs (and one PLL) in each corner of the FPGA.
The main input (called ZI) of the WFG can be fed by:
Semi-dedicated clock inputs
PLL outputs
The WFG main output (ZO) is directly connected to the low skew routing clock distribution resources. They are organized as three groups with specific connectivity on their outputs:
Three WFG_C#: connection of the WFG outputs to the NG-MEDIUM core (fabric and IO banks). Typically used for global clocks.
Three WFG_M# : connection of the WFG outputs to the core (fabric and IO banks) via the central clock switch, or direct connection to the 4 complex IO banks FFs in the same half of FPGA (top or bottom). Can be used for additional global clocks or for fast clocks on complex IO banks
Two WFG_R# : fast connection to the 2 neighboring complex IO banks FFs in the same FPGA corner (Top-left, top-right, bottom-left or bottom-right)
“nxmap” synthesis and implementation tools automatically maps the convenient WFGs and associated low skew routing resources, depending on the design requirements and user’s constraints.
The WFG, can also be used to generate user’s defined patterns and/or provide a programmable phase shift (delay line).
Selectable and programmable 2 to 16-tap pattern generator: by defining a pattern length and a 16-bit pattern value, the WFG can be used as clock divider or user’s defined pattern generator (up to 16 steps length).
Multiple WFGs patterns of the same corner can be synchronized together, by connecting the SI input (Synchronization Input) of the slaves WFG to the master SO output. Note that all the WFG (master and slaves) must have a pattern of the same length. They also need to use the same clock polarity.
Additionally, a selectable and programmable delay chain can be inserted on the output path (0 to 63 taps of 160 ps).
More detailed information in the NX Library Guide.
The next figure shows a simplified diagram of the WFG. When the pattern generator is not used, the clock can be still forwarded to the low skew network (clock trees), after optional polarity change and/or user’s selectable and adjustable delay (0 to 63 delays of 160 ps).
The WFG pattern generator is built around a 4-bit counter, a 4-bit comparator and a 16-bit ROM to store the user’s defined pattern.
The counter is usually reset by the RDY pin (while the PLL is not yet locked). Note that the RDY output pin of the PLL is generated synchronously with the PLL output clocks.
When RDY is asserted (the PLL is locked), the internal counter starts counting until it reaches the “pattern_end” value, and the SO ‘synchronization output goes high to restart the counter. In the case of a WFG generating the clock feedback of a PLL, the RDY pin must be left unconnected to allow the clock propagation before the PLL is locked.
The SO (Sychronization Output) is directly connected to the SI (Synchronization Input).
When using several WFG with patterns, it can be necessary to synchronize all the pattern generators. One of the WFG is chosen as “master” to synchronize all the “slaves”, by connecting the master SO output to all slaves (and master) SI inputs.
To make the WFG synchronization possible, all of them must use a pattern with the same length. Each pattern configuration will generate its own waveform.
Any one of the WFG of a same group can be chosen as “master”. Its SO output must feed the SI input of all the synchronized WFG, master included.
PLL
Four PLLs (one in each corner) allow efficient and flexible clock management. They have direct connections to the adjacent WFG (in the same corner).
The PLL can be used for two main purposes:
Implementation of a ZERO_DELAY clock distribution mechanism (improves FPGA clock to out and reduces the risk of hold time violation on inputs)
Phase controlled internal clocks generation (for example CLK_X2 and CLK_DIV2)
PLL features summary:
Input frequency range : 20 to 200 MHz
FeedBack frequency range : 20 to 200 MHz
VCO frequency range : 200 to 1200 MHz
Selectable dividers by 2 on reference clock and feedback path
Programmable delay line on feedback path (can be used for fine tuned phase adjustment).
Configurable divider (ratio 1 to 31) on internal feedback path
FeedBack can be done internally (no phase control) or externally, using the signal coming from a clock tree for phase controlled clocks generation.
Internal 100 MHz oscillator
PLL inputs and outputs:
REF: reference clock input (must be in the range 20 to 200 MHz, depending on PLL settings). REF input can be fed by semi-dedicated clock input pins or by low skew network. An optional (user’s selected) divider by 2 can be used on the REF input.
FBK: input pin for external feedback – if used (feedback can also be external). FBK can be fed by semi-dedicated input pins or by low-skew network. An optional (user’s selected) divider by 2 can be used on the REF input.
VCO output : direct output of the PLL’s VCO
D1 to D3: Three additional outputs with individual programmable dividers by: 1, 2, 4, 8, 16, 32, 64 and 128. Those outputs are generated by dividing the frequency of the VCO output. The outputs divided by any value different of 1 are reset until 128 VCO cycles before RDY is asserted.
The D1 to D3 outputs use dividers (by power of two) to provide frequencies such as
F(D#) = F(VCO) / (2**clk_outdiv#) where clk_outdiv# is in the range 0 to 7
OSC: Internal 200 MHz oscillator output. Used for delays calibration on the PLL feedback path, WFG internal delays and input/output delays. Osc output can also be used as auxiliary clock.
RDY status: this status pin goes high when the PLL is locked. NanoXplore recommends to use a delayed version of the RDY to generate the internal RESET signal, by using a Flip-flop chain to ensure a safe startup behavior on the corresponding clock domains. (see VHDL example on figures 38a, 38b and 38c)
Generating multiple synchronous clocks with the divided PLL outputs
Each PLL has 3 outputs of clock dividers. Each divider divides the VCO frequency by 1, 2, 4, 8, 16, 32, 64 or 128, according the value assigned to the corresponding “clk_outdiv#” generic parameter (clk_outdiv# = 0 to 7 for divisions by 2**clk_outdiv#).
Have a look on the following simulation screenshot for the RDY, D1, D2 and D3 PLL outputs :
We can observe that only the falling edges of D2 and D3 occur simultaneously.
Now, let’s analyse the following case where VCO frequency = 650 MHz.
Three clocks are generated : SCK (65 MHz), DCK(16.25 MHz) and ECK (8.125 MHz).
SCK (65 MHz) is generated by a WFG driven by D1 (650 MHz)
DCK (16.25 MHz) is generated by a WFG driven by D2 (162.5 MHz)
ECK (8.125 MHz)is generated by a WFG driven by D3 (81.25 MHz)
The 3 WFG use the same configuration :
wfg_edge => '0', -- 0: no invert -- 1: invert mode => '1', -- 0: no pattern -- 1: pattern pattern_end => 9, -- 0: to 15 (1 to 16 steps) pattern => "1111100000" & "000000", -- pattern p0 ... p15 delay_on => '0', -- 0: no delay - 1: delay
The following is a simulation screenshot of the 3 generated clocks :
We can observe that the rising edges of DCK occur simultaneously with SCK rising edges, but ECK rising edges show a delay of 3.076 ns.
This will create very hard to meet timing constraints on “SCK_to_ECK” and “DCK_to_ECK” data paths.
To overcome this potential problem, we must invert the clock polarity of the WFGs that generate the DCK and ECK clocks (using the D2 and D3 PLL outputs that divide the VCO clock by respectively 4 and 8).
WFG_SCK :
wfg_edge => '0', -- 0: no invert -- 1: invert mode => '1', -- 0: no pattern -- 1: pattern pattern_end => 9, -- 0: to 15 (1 to 16 steps) pattern => "1111100000" & "000000", -- pattern p0 ... p15 delay_on => '0', -- 0: no delay - 1: delay
WFG_DCK :
wfg_edge => '1', -- Will use D2 falling edge mode => '1', -- 0: no pattern -- 1: pattern pattern_end => 9, -- 0: to 15 (1 to 16 steps) pattern => "1111100000" & "000000", -- pattern p0 ... p15 delay_on => '0', -- 0: no delay - 1: delay
WFG_ECK :
wfg_edge => '1', -- Will use D3 falling edge mode => '1', -- 0: no pattern -- 1: pattern pattern_end => 9, -- 0: to 15 (1 to 16 steps) pattern => "1111100000" & "000000", -- pattern p0 ... p15 delay_on => '0', -- 0: no delay - 1: delay
The SCK, DCK and ECK clocks phase relationship is not yet as expected.
However, the pattern of the 3 WFG can be then adjusted to guarantee the optimum phase relationship between SCK, DCK and ECK for relaxed clock crossing domain constraints.
WFG_SCK :
wfg_edge => '0', -- 0: no invert -- 1: invert mode => '1', -- 0: no pattern -- 1: pattern pattern_end => 9, -- 0: to 15 (1 to 16 steps) pattern => "1000001111" & "000000", -- pattern p0 ... p15 delay_on => '0', -- 0: no delay - 1: delay
WFG_DCK :
wfg_edge => '1', -- Will use D2 falling edge mode => '1', -- 0: no pattern -- 1: pattern pattern_end => 9, -- 0: to 15 (1 to 16 steps) pattern => "0111110000" & "000000", -- pattern p0 ... p15 delay_on => '0', -- 0: no delay - 1: delay
WFG_ECK :
wfg_edge => '1', -- Will use D2 falling edge mode => '1', -- 0: no pattern -- 1: pattern pattern_end => 9, -- 0: to 15 (1 to 16 steps) pattern => "1111100000" & "000000", -- pattern p0 ... p15 delay_on => '0', -- 0: no delay - 1: delay
Note : in order to guarantee the optimum phase relationship between synchronous clocks, NanoXplore recommends to use pre-synthesis simulation.
Notes on delays calibration (PLL, WFG, IOs)
The PLL have a user’s selectable and adjustable (no delay or 0 to 63 x 159 ps +/- 5% delay taps) on the feedback path. A similar delay chain is available in each WFGs. Finally the IO banks have input, output and tri-state command 64-tap delay chains.
All the delay chain taps are calibrated with the same procedure and hardware resources.
The calibration procedure is automatic and transparent to the user.
The delays calibration system uses the PLL 100 MHz oscillator output as reference clock to calibrate all delays: feedback path in the PLL itself, WFG delays in same CKG, and IO delays in the two neighboring complex and simple IO banks:
CKG1 oscillator calibrates the delays in CKG1 (PLL + WFGs) and IO banks 12, 11, 10 and 9
CKG2 oscillator calibrates the delays in CKG2 (PLL + WFGs) and IO banks0 and 1
CKG3 oscillator calibrates the delays in CKG3 (PLL + WFGs) and IO banks2, 3, 4, 5
CKG4 oscillator calibrates the delays in CKG4 (PLL + WFGs) and IO banks 6, 7 and 8
The calibration procedure takes about 10 us at startup. No status is available on NG-MEDIUM
Clocks distribution
Overview
NG-MEDIUM provides an efficient and flexible clock distribution architecture.
The global clocks are distributed from their respective CKG source to a clock switch located on the center of the die. Each CKG is able to generate 3 x 2 = 6 global clocks. In total there are up to 6 clocks x 4 corners = 24 possible global clocks.
On each left or right side of the die, up to 12 global clock can be distributed.
The fabric (FPGA core logic) is split vertically in two halves (left and right), also called “zones” or “clock regions”. All the FFs into a clock region can be clocked by one of the 12 possible global clocks (3 global clocks can be generated by each CKG). Note that simple and complex banks IOs single FFs can use the same clocks.
Primary global clocks
Each CKG can generate up to 3 global clocks to be distributed via a central clock switch to the fabric and the IOs.
The total number of primary global clocks is 12 (3 per CKG).
Each CKG can generate and distribute up to 3 global clocks, able to reach all FPFA fabric and IO FFs with a very low skew.
Alternate global clocks or fast IO clocks
Each CKG can also generate alternate clocks able to reach all FFs in the FPGA fabric (including simple and complex banks IOs FFs), but they can also be redirected to the four complex banks of the same top or bottom half of the die, for local fast IO clock.
The distribution of those clocks is exclusive to the FPGA fabric or complex IO banks fast clocks.
Local fast IO clocks
Additionally, each CKG can also generate (via WFG) very fast clocks able to reach the FFs of the two neighboring complex banks. This can be useful for example for the DQS signal when capturing DDR SDRAM data, or when using IO serdes fast clock.
Clocks distribution summary
The next figure summarizes the primary and alternate clocks distribution (via the central clock switch), as well as the fast IO clocks distribution.
The central global switch receives up to 6 clocks from each one of the 4 clock generators (CKG), and distributes them to their destinations.
The central clock switch also receives up to 4 signals from each clock region (or zone). Those signals can then be distributed on the low skew network:
as clock (possible but not recommended, due to the skew with other clocks)
as reset or load_enable signal for FFs dedicated inputs.
Depending on available resources and design requirements, “nxmap” makes the decision to use or not the low skew routing resources for internally generated signals.
Application examples
In many applications, the communications between the FPGA and external components – ADC or DAC for example) can be timing critical. If for example, the board frequency is 100 MHz, the external components will also be synchronous of this same clock (period = 10 ns).
Analyzing delays for FPGA outputs:
Taking the example of an external DAC, and assuming that the DAC setup requirement is 3.5 ns, the FPGA must deliver valid and stable data at:
Period – Ext_Setup = 10 ns – 3.5 ns = 6.5 ns
Assuming that the FPGA data source is delivered by a tile FF, the output delay will be the sum of the following timings.
Tclock_distribution_delay that includes (timing estimations to illustrate de example)
Routing from pad to WFG input : ~0.6 ns
WFG propagation delay : 0.7 ns
Clock tree routing delay : 3.2 ns
Total from clock input pad to FFs clock input pin: 4.5 ns
T(FF to PAD) that includes (timing estimations)
Tile_FF clock to out : ~0.7 ns
Routing delay : ~1 to 5 ns (implementation dependant)
Output buffer delay : ~1 to 2.5 ns (varies upon electrical parameters)
Total FF to PAD: 2.7 ns to 8.2 ns (depending on internal routing and electrical parameters of the IOs)
Total from clock input pin to output pads: 7.2 ns (best case) to 12.7 ns (while 6.5 ns was required for this example).
We can see that, even in the best case we are missing the setup time requirement of the destination (7.2 ns – 6.5 ns = 0.7 ns timing error).
However, the delays vary upon PVT. For example, the worst case 6.3 ns clock delay, could become shorter at lower temperature and higher voltage. It could easily be temporarily reduced to 4 ns or less.
The following figures illustrate the timing diagrams for output and inputs from/to the FPGA.
In the input timing diagram, the clock distribution delay is called “Min clock delay”.
We can clearly observe that the output data is not stable during the setup/hold window of the setup timing requirement of the destination. Timing requirement is not met (3.4 ns error).
However, due to PVT variations, the FPGA internal delays can be reduced. In certain conditions, we could temporarily observe a correct – but unstable – behavior.
Analyzing delays for FPGA inputs:
To analyze the FPGA input timings, we should consider the timing variations over PVT.
The clock distribution delay (6.3 ns max in our example) could be easily reduced to 4 ns or less at lower temperature and/or higher Vcc.
Assuming for example 4 ns clock delay and 1.5 ns FF setup requirement for the FPGA input FF, we can see again that the input data will not be stable during the setup/hold time window.
The external connections of both input and output can NOT meet those apparently simple requirements. Some improvements must be found.
Possible timing improvements:
In a first time, we should adjust the IOs electrical parameters to use the “FAST” slew rate and drive current of the output buffer. This can improve the timing at the cost of additional noise due to the faster transitions on the data bus. However, the improvement will be of the order of 1 ns, more or less.
The IOs electrical parameters can be defined in a script file, using the “project.addPad” (single pad) or “project.addPads” (multiple pads) commands. For more information, see chapter 3.1.5 and NanoXplore “nxmap” Python API documentation.
Another complementary way to reduce the FPGA clock to out delay is to map the flip-flops into the IO blocks. This will eliminate the routing delay between the FF and the output buffer. We improve the timing by 1 to 3 ns, and make this timing independent of the implementation (no general routing delay).
This can be done by setting a project option in the script file:
project.setOption({‘MergeRegisterToPad’ : ‘Always’}) # other possible values : ‘Never’, ‘Input’ or ‘Output’
However, we can observe that the clock distribution is the longest delay in the chain. This delay can be cancelled by using a PLL + WFG and clock tree. By connecting the clock tree to the feedback input of the PLL, the clock distribution delay can be cancelled (near 0 ns).
Input and output timing improvements –taking advantage of the PLL and low skew network:
The PLL and WFG can be used to generate internally one or several clocks. All the clocks can be phase aligned with the incoming reference clock (pad of semi-dedicated clock input pin).
For this, the PLL feedback must be external to the PLL, and must come from a clock tree, so the phase alignment can be done by comparing the relative phases of the clock input pin and the clock tree.
configuration (timing estimation):
Tclock_distribution : 0,0 ns
T(FF to PAD) that includes (FF merged into IOB)
IOB_FF clock to out : ~2 to 4 ns (varies upon electrical parameters)
Total FF to PAD: 2 ns to 4 ns
Total from clock input pad to output pads: 2 ns to 4 ns (while 6.5 ns was required for this example).
We can see that the setup time requirement of the destination is now met, independently of the implementation (no general routing involved in this timing). No transitions on data during setup/hold window)
Let’s analyze the input path in this configuration:
We can see that the setup time requirement of the FPGA input FF is now met, independently of the implementation (no general routing involved in this timing). No transitions on data during setup/hold window)
Zero delay clock distribution using PLL + WFG and additional synchronous clocks
As explained previously, the PLL can be used to generate internal clocks phase aligned with the input reference clocks edges.
In this case, the PLL output clock periods must equal the input period multiplied or divided by powers of 2 (1, 2, 4, 8….128)
In this example, we will show how to generate three internal clocks, phase aligned with the reference input clock pad:
Reference input clock : 80 MHz (CLKIN_80)
Internal phase aligned 80 MHz clock (CLK_40)
Internal phase aligned 40 MHz clock (CLK_80)
Internal phase aligned 160 MHz clock (CLK_160)
RST generation (synchronously with the internally generated 80 Mhz clock)
For a correct behavior of the PLL, the VCO frequency must be higher than 200 MHz and lower than 1200 MHz. We will set it for 320 MHz frequency (80 MHz x 4). For convenient PLL setup (generic assignments), consult the NX Library guide.
The following is the source code of clock generation module:
library IEEE; use IEEE.STD_LOGIC_1164.ALL; library NX; use NX.package.all; entity CKG_MODULE_320 is generic(simu : boolean := false) Port( REF: in STD_LOGIC; CLK_160: out STD_LOGIC; CLK_80: out STD_LOGIC; CLK_40: out STD_LOGIC; RST: out STD_LOGIC; OSC: out STD_LOGIC; ); end CKG_MODULE_320; architecture Behavorial of CKG_MODULE_320 is signal VCO_320 : std_logic; signal RDY : std_logic; signal FBK : std_logic; signal SO : std_logic; signal RST_DELAY : std_logic_vector(3 downto 0); begin
Entity and signals declarations of Zero delay clock generation example
INST_PLL: NX_PLL generic map( vco_range => 0, ref_div_on => '0', fbk_div_on => '0', ext_fbk_on => '1', fbk_intdiv => 4, fbk_delay_on => '0', fbk_delay => 11, clk_outdiv1 => 1, clk_outdiv2 => 2, clk_outdiv3 => 4 ) PORT MAP( REF => REF, FBK => FBK, VCO => VCO_320, D1 => open, D2 => open, D3 => open, OSC => OSC, RDY => RDY ); INST_WFG_80 : NX_WFG generic map( mode => '1', wfg_edge => '0', pattern_end => 7, pattern => "0110011000000000", delay => 0, delay_on => '0' ) PORT MAP( RDY => open, SI => SO, ZI => VCO_320, SO => SO, ZO => FBK ); CLK_80 <= FBK;
Instantiation of PLL and first WFG for Zero delay clock example
INST_WFG_160 : NX_WFG generic map( mode => '1', wfg_edge => '0', pattern_end => 7, pattern => "0101010100000000", delay => 0, delay_on => '0' ) PORT MAP( RDY => RDY, SI => SO, ZI => VCO_320, SO => open, ZO => CLK_160 ); INST_WFG_40 : NX_WFG generic map( mode => '1', wfg_edge => '0', pattern_end => 7, pattern => "0111100000000000", delay => 0, delay_on => '0' ) PORT MAP( RDY => RDY, SI => SO, ZI => VCO_320, SO => open, ZO => CLK_40 ); process(FBK) begin if falling_edge(FBK) then RST_DELAY <= RST_DELAY(2 downto 0) & not(RDY); end if; end process; RST <= RST_DELAY(3); end Behavioral
Additional WFG instantiation and RST generation
This figure shows the timing diagram of this design example. The rising edges of the internal clocks (CLK_80, CLK_160 and CLK_40) are phase aligned with the rising edges of the reference clock input pin (CLKIN_80).
Input and output timing requirement can be easily met. The “period” constraints applied to the internal clocks will cover all internal paths, while the input and output timing requirement will be respectively covered by the setInputDelay and setOutputDelay constraints. In addition the clock domain changes between the three internal clocks can be easily managed by the implementation and timing analyzer tools.
“nxmap” does not forward the constraint applied to the clock input to the PLL and WFG outputs. The user is required to specify the period and phase constraints, using the “createClock” script command, to each one of the PLL/WFG generated internal clocks.
Multiple clocks generation using PLL internal feedback
When using the PLL internal feedback, the user can take advantage of the nDivider to generate frequencies with more complex ratios.
Note that in this case, the generated clock won’t be phase aligned with the reference clock input pad.
For internal feedback, we can use two or three additional dividers on the VCO output :
The VCO output is first divided by 2, before entering the programmable nDivider. This divider is not optional. It can’t be bypassed.
nDivider : division factor programmable by integer values (1 to 31, defined with the “fbk_intdiv” generic value)
Divider by 2 on the feedback path : this divider can be selected or bypassed by setting the “fbk_div_on” generic to ‘1’ (selected) or ‘0’ (bypass)
The global division factor on the internal feedback path is in the range 4 to 62, by steps of 2 when “fbk_div_on” is set to ‘0’. However, the VCO multiplication factor can range from 2 to 31 if the “ref_div_on” is set to ‘1’ (divider by 2 in the reference input clock). See NG-MEDIUM datasheet and Library Guide for more information.
In the following example, the 80 MHz REF clock frequency is multiplied by 5 to generate a 400 MHz at the VCO output.
Then, three WFGs are used to generate the three internal clocks of 200 MHz, 100 MHz and 50 MHz respectively.
INST_PLL: NX_PLL generic map( vco_range => 0, ref_div_on => '1', fbk_div_on => '0', ext_fbk_on => '0', --external feedback fbk_intdiv => 5, -- Int div fbk_delay_on => '0', fbk_delay => 0, clk_outdiv1 => 0, clk_outdiv2 => 0, clk_outdiv3 => 0 ) PORT MAP( REF => REF, -- 80MHz input reference clock FBK => FBK_open, -- No external feedback VCO => CLK_400, -- VCO output at 400 MHz (80MHz x 5 x 2) / 2 D1 => open, D2 => open, D3 => open, OSC => OSC, RDY => RDY );
Instantiation of PLL internal feedback (no phase control)
INST_WFG_100 : NX_WFG generic map( mode => '1', wfg_edge => '0', pattern_end => 7, pattern => "0110011000000000", delay => 0, delay_on => '0' ) PORT MAP( RDY => RDY, SI => SO, ZI => CLK_400, SO => SO, ZO => CLK_100 ); INST_WFG_200 : NX_WFG generic map( mode => '1', wfg_edge => '0', pattern_end => 7, pattern => "0101010100000000", delay => 0, delay_on => '0' ) PORT MAP( RDY => RDY, SI => SO, ZI => CLK_400, SO => open, ZO => CLK_200_INT );
Instantiation of WFG for 100 MHz and 200 MHz
INST_WFG_50 : NX_WFG generic map( mode => '1', wfg_edge => '0', pattern_end => 7, pattern => "0111100000000000", delay => 0, delay_on => '0' ) PORT MAP( RDY => RDY, SI => SO, ZI => CLK_400, SO => open, ZO => CLK_50 ); CLK_200 <= CLK_200_INT; process(CLK_200_INT, RDY) begin if RDY = '0' then RDY_DELAY <= (others => '0'); elsif rising_edge(CLK_200_INT) then RDY_DELAY <= RDY_DELAY(6 downto 0) & RDY; end if; end process; RST <= not(RDY_DELAY(7)); end Behavorial;
Instantiation of WFG for 50 MHz and RST generation
FPGA core logic
For a detailed description of the NG-MEDIUM architecture, please refer to the NG-MEDIUM datasheet.
However, we will give some additional information and usage guide lines.
The NG-MEDIUM core logic is organized as 5 rows of complementary logic resources.
3 rows of Tile logic: combinatorial and arithmetic logic, registers and simple dual port 1-Kbit RAM.
2 rows Coarse Grain Blocs (CGB): contains high performance 48K-bit synchronous true dual port RAM blocks, and cascadable DSP blocks.
The NG-MEDIUM available user’s logic is composed of 3 rows of 28 tiles. The total amount of the available tile logic resources is the following:
34272 x LUT4 (2016 of them are X-LUT – extended LUTs for wide combinatorial functions – mode details later)
32256 x DFFs (total 32256 FE)
8064 bits of carry logic
168 register file (1-Kbit simple dual port RAM, organized as 64 x 16-bit)
336 CKS (glich free ClocK Switches)
Routing resources overview
The NG-MEDIUM architecture provides 4 kinds of internal routing resources:
Low skew network: balanced routing to provide homogeneous distribution of the global signals (clocks, but also RST and Load_Enable inputs of the FFs).
Direct interconnections between adjacent logic resources (FEs to/from X-LUT, FEs from/to Carry logic, DSP chained inputs and outputs). The direct interconnections are very fast (near 0 ps delay).
Tile logic internal routing: Into the same tile, the router can use short routing resources. The tile internal connections are faster than inter-tile or other general connections.
General routing resources: Tile from/to tile, tile from/to BRAM, tile from/to DSP blocs, BRAM from/to DSP blocks: General routing resources will be used. The general routing resources have a longer propagation delay than the tile internal routing resources.
Routing resources impact on performance and power consumption:
The user must understand that the FPGA performance and dynamic power consumption are tightly impacted by the amount of general routing resources used in the design. Using more general routing resources, implies lowest design performance and higher power consumption.
For this reason, it’s very important to take advantage of the architectural features and direct or short interconnects (Carry logic, X-LUT, register files, DSP chains…), by following simple but efficient HDL coding techniques, to allow “nxmap” synthesis and implementation tools using the best routing resources.
Tile logic
The tile logic will be used to implement most combinatorial and/or registered logic of your design. The tile is mainly composed of Functional Elements (FE). Each FE contains a 4-input Look Up Table (LUT-4) and a D-type Flip-Flop (FF or DFF):
The LUT-4 can implement any 4-input combinatorial function. The flip-flop have sync/async reset and LoaD_enable inputs. It can be used or bypassed.
The functional elements (FE) as well as additional logic resources (to be described later) are grouped into tiles.
The tile logic will be used to implement most combinatorial and/or registered logic of your design. Each tile contains 384 FE and additional resources:
384 x 4-input Look-Up tables (LUT4)
384 x D-type Flip-Flops (DFF or FF). Can be initialized by bitstream
24 x X-LUT, for up to 16-input combinatorial functions. The X-LUT is also a 4-input LUT that combines up to 4 neighboring 4-input LUT. The connectivity between the 4 x 4-input LUTs and the X-LUT uses dedicated direct routing. The X-LUT output can be registered by using one of the 4 neighboring FE flip-flop.
96-bit carry logic for arithmetic functions (can be segmented by groups of 4 bits)
2 x Register File (RF): single port or simple dual port synchronous 64 x 16 RAM, with embedded EDAC. The RF contents at startup can be initialized by bitstream, if the default values are specified in the source code.
4 x CKS ClocK Switch
Functional Element (FE): Each FE contains one LUT-4 associated to a D Flip-Flop with programmable clock polarity, RESET and LOAD input. The DFF can be used or bypassed. A tile contains 384 FE.
Each tile is organized as follows:
The carry logic (CY) is associated with the neighboring FEs in the same stripe, using direct interconnects. They can be chained with direct interconnections from stripe 1 to stripe 2, and from stripe 2 to stripe 1 (brown arrows on left and right sides). The carry chain can be up to 96-bit length into a single tile.
Each X-LUT is associated with the four LUT4 of the FEs located in the same column and stripe, using direct interconnects (blue arrows in stripe 3)).
REG_FILE usage requires using the LUTs of the 32 neighboring FEs to route addresses, data and control signal to the REG_FILE inputs.
For a more detailed description of the NG-MEDIUM architecture, please refer to the NG-MEDIUM datasheet.
Functional element (FE)
In each tile, there are 4 stripes (1 to 4) of 96 Functional Elements (FE)
Each functional element includes one 4-input Loock-Up Table and a D type Flip-Flop (DFF or FF).
The 4-input LUT can implement any combinatorial function of 4 inputs. It can also be used a 16x1 ROM. The LUT contents are initialized by bitstream.
The Flip-Flop (FF) can be used or bypassed – for example to allow combining several LUTs, for combinatorial functions of more than 4 inputs.
Programmable clock polarity on each FE flip-flop
Dedicated active high RST input (can be synchronous or asynchronous)
Dedicated active high LD (load input)
Clock comes from low skew network. Load Enable and RST can come from Low Skew network or local routing.
In addition, the FEs have direct connections to/from extra logic for extended functions such as :
Carry logic (hardware accelerator for fast, dense and predictable arithmetic functions - stripe 1 and 2)
X-LUT for up to 16-input combinatorial functions (stripe 3)
Register Files (64 x 16 simple dual port RAM), and CKS (ClocK Switches) - stripe 4).
The output signal of the extra logic (companion logic such as Carry Logic, X-LUT and Register File) can enter to the D input of the FE flip-flop via a static multiplexer configured by bitstream, to register its outputs.
“nxmap” synthesis and implementation tools will map the required Functional Elements (FE) to implement the functions described in the source code. Depending on the required functionality, the logic will be implemented in the following stripes:
Stripe 1 or stripe 2 for arithmetic functions
Stripe 3 for wide combinatorial functions
Stripe 4 for small single port or simple dual port memories
Any stripe for regular logic (combinatorial/registered)
Carry logic:
Carry logic is located in stripes 1 and 2 of each tile
In each tile, the two upper stripes (stripe 1 and stripe 2) provide additional logic resources (carry logic) to implement very fast, compact and predictable arithmetic functions such as adder, subtractor, adder/subtractor or magnitude comparators. Any carry logic + associated FEs can implement 1 to 4-bit wide arithmetic operators.
When combined with the FE flip-flops, registered arithmetic functions can fit into the same stripe of FEs (like for example accumulators).
Inside a single tile, the maximum number of bits for arithmetic function is 96 bits. The carry chain uses direct routing (no routing delay). It can be segmented by sets of 4 bits. For example, we could implement in a single tile up to 8 x 12-bit arithmetic functions.
All the other FEs are free to implement additional logic functions.
“nxmap” synthesis tools recognizes the arithmetic functions of the source code by detection of the following VHDL operators: “+”, “-“, “<”, “>”, “<=” or “>=”.
However, for arithmetic functions of 4 bits or less, the functions will be implemented with LUTs exclusively (no carry logic is used).
X-LUT:
X-LUTs are located in stripe 3 of each tile
The X-LUT is an additional 4-input LUT that combines the outputs of the four neighboring LUT4 in the same FE’s column of stripe 3. Direct interconnects allows to implement timing optimized combinatorial functions of up to 16 inputs. The X-LUT output can be directed to the associated flip-flops of the related FEs;
Register file (synchronous simple dual port RAM):
Register files are located in stripe 4 of each tile. There are two Register_File in each tile.
The Register File are 1-Kbit synchronous Simple Dual Port (SDP) memory blocks. One port is dedicated to writing, the other one to reading.
Organization : 64 x 16-bit
Separate READ_clock and WRITE_clock
Optional pipeline output register for faster Tco (clock to out)
Embedded SECDEC EDAC (transparent to the user). Able to automatically correct any single bit error or detect double bit errors.
The Register_file is a set of combined resources:
One Simple Dual Port 64 x 16-bit RAM array with embedded EDAC
32 x LUT of the same tile stripe
FE Flip-Flops for optionally registered outputs
The register file can be instantiated (see Library Guide), or inferred from VHDL source code.
The next figure shows a simplified internal diagram of a Register_File:
ClocK Switch (CKS):
Just as Register_Files, Clock Switches (CKS) are located in stripe 4 of each tile. There are four CCKS in each tile.
The CKS allows to generate a glitch free enable/disable clock. This can – for example - contribute to reduce the power consumption for the logic that does not require a permanent clock.
Note that using a CKS element implies using an associated neighboring LUT in the same tile stripe.
See the detailed description in the NX_Library_Guide for more information.
RAM blocks (48Kb True Dual Port RAM)
Description
NG-MEDIUM has two rows of Coarse Grain Blocks (CGB). Each CGB is composed of 1 RAM block (48 K-bit each) and 2 DSP blocks.
In a single row, there are 28 RAM blocks + 56 DSP blocks, for a total of 56 RAM blocks + 112 DSP blocks.
The memory block is a True Dual Port synchronous 48K-bits SRAM. The memory is configurable and supports various modes of operation. Each port can perform read and/or write operations. The data can be protected by a hardware SECDED EDAC. This EDAC function can be bypassed. The ECC signature is computed during the write cycle and checked during the read cycle.
An optional feature is the Read Repair mode. When this mode is enabled and a correctable error is detected during the read cycle, then the memory array is updated with the corrected data/ECC value.
Here is a summary of the NX_RAM main features and possible configurations.
Without EDAC: 49152 x 1-bit 24576 x 2-bits 12288 x 4-bits 6144 x 8-bits 4096 x 12-bits 2048 × 24-bits |
With EDAC: 2048 x 18-bits |
Programmable positive / negative clock edge |
Optional pipeline input and output registers |
Memory content can be optionally initialized by bitstream |
Embedded EDAC |
Automatic Read Repair Mode |
The next figure shows a simplified internal diagram of a RAM block.
The internal memory core is organized as 2Kx24. For user access, it can be organized on many ways on each port. Each port have separate data inputs, addresses, control signals, clocks as well as data and flag outputs. The two ports can be configured independently.
In order to provide higher performance, pipeline registers can be used on the two input ports and the two output ports. Note that in addition, using the output pipeline registers also reduces the memory output delay. The design performance is increased, at the cost of one or two latency cycle.
Supported RAM block configurations without EDAC
The blue highlighted configurations are directly supported by “nxmap”, just by assigning a generic parameter when instantiating the NX_RAM primitive. However, the user can define any of the following block RAM configurations by instantiating the NX_RAM primitive, and properly assigning all related generic parameters. (See Library_Guide.pdf for more details).
Port0 (A) Port1 (B) | 2Kx24 | 4Kx12 | 6Kx8 | 12Kx4 | 24Kx2 | 48Kx1 | |
2Kx24 | NOECC 2Kx24 | Yes (user’s) | Yes (user’s) | Yes (user’s) | Yes (user’s) | Yes (user’s) | |
4Kx12 | Yes (user’s) | NOECC 4Kx12 | Yes (user’s) | Yes (user’s) | Yes (user’s) | Yes (user’s) | |
6Kx8 | Yes (user’s) | Yes (user’s) | NOECC 6Kx8 | Yes (user’s) | Yes (user’s) | Yes (user’s) | |
12Kx4 | Yes (user’s) | Yes (user’s) | Yes (user’s) | NOECC 12Kx4 | Yes (user’s) | Yes (user’s) | |
24Kx2 | Yes (user’s) | Yes (user’s) | Yes (user’s) | Yes (user’s) | NOECC 24Kx2 | Yes (user’s) | |
48Kx1 | Yes (user’s) | Yes (user’s) | Yes (user’s) | Yes (user’s) | Yes (user’s) | NOECC 48Kx1 |
Data input pins and RAM block configuration:
When the RAM block is used without EDAC with data bus width being a subset of 24-bit (12-bit, 8-bit, 4-bit, 2-bit or 1-bit), the data inputs (AI or BI) must be connected as follows:
24-bit input width: data_in(23:0) connected to AI24 to AI1 (or BI24 to BI1)
12-bit input width: data_in(11:0) duplicated on AI12 to AI1 and AI24 to AI13 (or BI12 to BI1 and BI24 to BI13).
8-bit input width: data_in(7:0) replicated 3 times to AI8 to AI1, AI16 to AI9 and AI24 to AI17 (or equivalent pins on port B)
4-bit input width: data_in(3:0) replicated 6 times to AI4 to AI1, AI8 to AI5, AI12 to AI9, AI16 to AI13, AI20 to AI17 and AI24 to AI21 (or equivalent pins on port B)
For 2-bit input width, the 2-bit data_in bus must be replicated 12 times
For 1-bit input width the data_in must be replicated 24 times.
Data output pins:
Read data appears on the AO24 to AO1 pins when the port output width is configured to 24-bit.
For other output width configurations, the output data is presented at the following output pins:
12-bit output width : AO12 to AO1 (or BO12 to BO1)
8-bit output width : AO8 to AO1 (or BO8 to BO1)
4-bit output width : AO4 to AO1 (or BO4 to BO1)
2-bit output width : AO2 to AO1 (or BO2 to BO1)
1-bit output width : AO1
Unused output pins can be left unconnected.
Physical and logical memory organization:
The memory is internally physically organized as a 2Kx24 array. However, the user can define different logical organization aspects.
Supported RAM block configurations with EDAC:
With EDAC, the maximum memory data width is 18-bit (2Kx18). For each 18-bit word, a 6-bit ECC signature is computed and stored into the memory array during the write cycles, for a total of 24-bit at each memory location.
Data input pins :
AI18 to AI1 must be used for port A for 18-bit input width. (BI18 to BI1 for port B)
9-bit input width: data_in(8:0) replicated 2 times to AI9 to AI1and AI18 to AI10 (or equivalent pins on port B)
6-bit input width: data_in(5:0) replicated 3 times to AI6 to AI1, AI12 to A7 and AI18 to AI13 (or equivalent pins on port B)
For 2-bit input width, the 2-bit data_in bus must be replicated 9 times
For 1-bit input width the data_in must be replicated 18 times.
Data output pins :
18-bit output width: read data appears on the AO18 to AO1 pins or BO18 to BO1. Unused output pins can be left unconnected.
9-bit output width : AO9 to AO1 (or BO9 to BO1)
6-bit output width : AO6 to AO1 (or BO6 to BO1)
2-bit output width : AO2 to AO1 (or BO2 to BO1)
1-bit output width : AO1 (or BO1)
User’s data and related ECC signature are read and checked during the read cycles. If a single error is detected, the user data is corrected on the fly, and the flag “xCOR” is activated. If a double error is detected, it can’t be corrected, and the flag “xERR” is activated.
Optional ECC_Read_Repair feature: the memory contents can be updated with the corrected value, by using the “ECC_Read_Repair” feature. In this mode called “SLOW”, a special mechanism is used to repair the memory contents during read cycles, by proceeding to a Read_Modify_Write cycle during a single user’s clock cycle. This mechanism requires using an internal clock at the double frequency of the user’s clock. The fast clock is generated internally by using an additional clock input (ACKD or BCKD) at the same frequency than the user’s clock and 90° shifted, to allow the generation of the internal fast clock.
Using the RAM blocks with EDAC in FAST mode:
This mode provides the highest performance (working frequency). The RAM block configuration is 2Kx18. In this mode, single errors are corrected at the RAM block outputs, and reported on the “xCORR” output flag. However, the memory content is not corrected. Double errors can’t be corrected, and the flag “xERR” is asserted.
Using the RAM blocks with EDAC with “ECC_Read_Repair” feature, in SLOW mode:
This mode provides the safer usage of the RAM blocks, at the cost of a reduced performance.
Each single error detected during a read cycle is automatically corrected by updating the memory content with the corrected value. The flag “xCORR” is asserted during one clock cycle each time a single bit error is found.
Using the optional input and output registers can be used for reducing performance loss.
Allowed RAM block configurations with “ECC_Read_Repair” (SLOW mode):
2Kx18
Read Cycle:
When the memory is enabled in a memory read cycle (xCS = 1 and xWE = 0), the address is stored on the rising memory clock (xCK) edge, and data appears at the output bus after the access time. The optional output pipeline registers are available in all memory configurations. These registers are clocked by xCKR signals, which may be different from the main memory clock signals xCK. The memory pipeline register may be forced to zero by asserting the synchronous xR signal. Both memory clocks and register clocks may have individually configured polarity. The presence of output pipeline registers is determined independently for each port inputs and outputs.
Write Cycle:
When the memory is enabled in a memory write cycle (xCS = 1 and xWE = 1), the address is stored and data is written to the memory on the rising edge of the memory clock (xCK). During a write access DOUT maintains the output previously generated by a read operation.
During any write cycle, the memory output retains its previous value. This behavior is often called “NO_CHANGE”, as data outputs do not change during write.
Simultaneous write on both ports of a same memory location or simultaneous read/write are not allowed.
For detailed information on NX_RAM ports and attributes, see Library_Guide.pdf for detailed information about the RAM block primitives, associated parameters and usage examples.
RAM blocks inference:
“nxmap” supports memory blocks inference. However the current version of the synthesis tools doesn’t take advantage of the optional input and output registers. Input and/or output registers will be implemented in tile FEs.
The following VHDL source code is an example of RAM block inference. Note that while writing, the output port retains its former value.
library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_unsigned.all. use IEEE.std_logic_arith.all; entity MY_MEMORY is generic ( Number_of_address_bits : integer := 10, Number_of_data_bits : integer := 16 ); port ( CLK : std_logic; DIN : std_logic_vector(Number_of_data_bits-1 downto 0); WE : std_logic; ADR : std_logic_vector(Number_of_address_bits-1 downto 0); DOUT : std_logic_vector(Number_of_data_bits-1 downto 0) ); End MY_MEMORY; architecture ARCHI of MY_MEMORY is type MEM_TYPE is array((2**Number_of_address_bits)-1 downto 0 of std_logic_vector(DIN’range); signal MEM : MEM_TYPE; begin process(CLK) begin if rising_edge(CLK) then if WE = '1' then MEM(conv_integer(ADR)) <= DIN; else DOUT <= MEM(conv_integer(ADR)); -- DOUT doesn’t change end if; -- during write end if; end process; end ARCHI;
About RAM blocks inference:
The optional input and output registers are not yet supported. If specified in the source code, they will be implemented with FE Flip-Flops
The RAM block can still use ECC in FAST mode, if the “NX_ECC” primitive is instantiated with the CHK input pin connected to the bit 0 of the RAM block output port.
DSP blocks
NG-MEDIUM provides high level DSP functions. A single NG-MEDIUM FPGA can implement more than 35 Giga multiplications and additions per second.
Architecture overview
Just as the RAM blocks, the DSP blocks are part of the Coarse Grain Blocks (CGB). There are 2 rows, each one of 56 DSP blocks. All the DSP blocks can be chained via direct routing (no routing delay) for fast and complex DSP functions implementation, such as filters and other functions.
The main DSP internal operators are:
18-bit + 18-bit pre-adder/subtractor (19bit result)
Signed/unsigned 19-bit x 24-bit multiplier (43-bit result)
Output post-adder/subtractor or accumulator (56-bit result)
In a single clock cycle, each DSP block can perform simultaneously the three operations (signed/unsigned pre-add/sub, multiplication and post-add/sub or accumulation.
By using the available pipeline registers, and the dedicated direct routing to chain the 56-DSP blocks in the same row, the user can ensure very high performance and predictable results after implementation (performance > 250 MHz).
Any one of the available pipeline registers can be used or bypassed. They use a common clock (CLK).
The input pin called “WE” is used as “clock enable” for all the registers. WE is active high. If WE is low, the DSP registers are frozen.
The output register can be synchronously reset by activating the “RSTZ” input (active high). RSTZ can be enable/disable by using a generic (default is “disable”.
All other registers can be synchronously reset by activating the “R” input (active high). All internal registers can have their reset enable/disable by using individual generic (default is disable). For A and B inputs, reset is assigned to the 3 levels of pipeline register.
Main DSP block inputs:
Control inputs
CK : clock (rising edge sensitive)
R : Optional reset for internal registers
RZ : Optional reset for Z register (DSP block output)
Data inputs
“A” input is 24-bit wide. An input multiplexer allows to select the 24 bits coming from the fabric or 18 bits from the “CAO” output of the previous (left) DSP block. “A” can go to one of the multiplier inputs through 0, 1, 2 or 3 pipeline registers, and/or forward its 18 LSBs to the “CAI” input of the neighboring (right) DSP block by using its CAO cascade output.
“A” input is most often used as an input to the multiplier.
“CAI” input: 18-bit input that can be used when the “A” input must receive the signal from the previous (on the right) DSP block via its “COA” chaining connection (direct routing, 0 ns delay), instead of the fabric.
Note that the DSP blocks can be chained from the right to the left.
“B” input is 18-bit wide. An input multiplexer allows to select the 18-bit signal coming from the fabric or the18 bits from the “CBO” output of the previous (left) DSP block. “B” can be directed to one of the pre-adder/subtracter, and/or to one of the multiplier inputs through 0, 1, 2 or 3 pipeline registers, and/or forward its 18 to the “CBI” input of the neighboring (right) DSP block by using its CBO cascade output.
“B” input is often used as a second input to the multiplier. It can also be used as operand of the pre-adder/subtracter.
“CBI” input: 18-bit input that can be used when the “B” input must receive the signal from the previous (left) DSP block via its “COB” chaining connection (direct routing, 0 ns delay), instead of the fabric.
“C” input is 36-bit wide. It’s directed to the ALU as a second operand (if required). “C” input can use 0 or 1 level of pipeline.
“C” input is often used as constant value input required for rounding operation.
“D” input is 18-bit wide. It’s directed to the pre-adder/subtracter through 0 or 1 level of pipeline registers.
“D” input is often used as input of the pre-adder/subracter. However for ALU dynamic opcode, the bits D(5:0) are used to dynamically select the operation to be performed.
“CZI” input is 56-bit wide. It comes from the neighboring (left) DSP block. It’s directed to the ALU (usually configured as post-adder/subtractor) through 0 or 1 level of pipeline registers.
“CZI” input is used when various DSP blocks have to be chained, for example in FIR filters and other DSP functions implementation.
Main DSP block outputs:
“CAO” is 18-bit wide. It’s used to forward the “A” input to the next (right) DSP block through 0, 1, 2, or 3 level of pipeline registers.
“CAO” input is used when two or mode DSP blocks have to be chained to forward data.
“CBO” is 18-bit wide. It’s used to forward the “B” input to the next (right) DSP block through 0, 1, 2, or 3 level of pipeline registers.
“COB” input is used when two or mode DSP blocks have to be chained to forward data.
“Z” output is 56-bit wide. It’s the main DSP block output to the fabric via general routing. It can be registered or not. The value available at the “Z” output is feedback to the ALU via the X-MUX (for example to implement an accumulator).
“CZO” output is also 56-bit wide. It allows to forward the registered or un-registered ALU output to the next (right) DSP block, via direct routing (no delay).
“CZO” is particularly useful to chain DSP blocks for FIR filters and other DSP functions requiring more than one DSP block.
DSP block features and operators:
Programmable pipeline registers
“A” and “B” inputs: 0, 1, 2, or 3 levels (user’s selectable)
“C” and “D” inputs: 0 or 1 level (user’s selectable)
Internal : Pre-adder, Multiplier, ALU and other
Pre-adder or subtracter 18b + 18b, result on 19b
Multiplier 19b x 24b, result on 43b. Can be signed or unsigned. “A” (24b) is one of the two operands. The second operand can be the “B” input (sign extended to 19b) or the output of the 19b pre-adder. The multiplier output can be directed to the ALU first operand or to the “Z” output
On signed/unsigned operations: each DSP block can be set for unsigned or signed operations. When “signed” is selected, all operators like multiplier and sign extensions are signed. When unsigned is selected, all operators are unsigned.
ALU: Can be configured for adder, subtracter or logical operations on two 56b operands. One of the two operands is the output of the multiplier after sign extension to 56b. The second operand is the output of the X-MUX;
X-MUX : allows to select the second operant for the ALU input
“Z” output feedback with or without 12b left shift (used for increased precision operations)
“CZI”: 56b wide, coming from the “CZO” output of the neighboring (left) DSP block.
“C”: 36b input, sign extended to 56b
Carry input and output: and propagation (in and out)
Chaining DSP blocks for very high speed complex processing:
In order to implement more complex and fast DSP functions requiring more than one single DSP block, all the 56 DSP blocks of the same CGB column can be chained with direct routing connections in the same row. No routing delay on those optimized resources guarantee very high performance and lower power dynamic consumption.
Note that in a same CGB column, several groups of DSP blocks can be chained. For example, we could implement a design using different 3 chains of DSP blocks, two chains of 20, and one chain of 16 DSP blocks (for a total of 56 DSP blocks in the same CGB column).
No general routing will be involved in the chain – if the synthesis can recognizes that the chain can be used.
Physically, the DSP blocks are chained from right to left (unlike in the figure 50. This can be very important for some DSP intensive applications where for routing optimization the inputs of the multi DSP blocks functions will be oriented to the right, while the outputs will be oriented to the left. This point should be taken in account while defining the FPGA pinout.
There are no direct routing resources between the two CGB columns. However, both columns can be chained by using general routing (performance penalty).
Frequently used DSP functions
Among the most common DSP functions, we find:
Multiplier / accumulator (with output sync reset)
Pre-adder / multiplier / accumulator (with output sync reset)
Multiplier / adder
Pre-adder / multiplier / adder
The architecture of the NG-MEDIUM DSP blocks is perfectly adapted to very efficiently implement those functions and many other, while offering very high performance and predictable results.
Let’s see some examples frequently used in DSP applications:
The FIR filter is very common in FPGA designs. Depending on the sample frequency and the number of taps, different strategies can be adopted.
Multiplier 24 x 30 (with two DSP blocks) 250 MHz, 4 clock latency
Each DSP blocks includes a 24x19-bit multiplier. For higher number of bits, we can combine two or more DSP blocks and some additional tile FFs. Let’s have a look on the example of a 24x30 signed multiplier. It can support a working frequency of 250 MHz with a latency of 4 clock cycles.
It provides user’s selectable and user’s adjustable rounding mechanism if all the 54 output bits are not required. Note that the rounding doesn’t require using additional resources.
Source code examples
About the 24x30 multiplier, two different versions of equivalent source code are available. Please, contact support@nanoxplore.com if you want to receive those VHDL examples and associated script files.
Single file source code: SMULT_24x30_NG_MEDIUM_FLAT.vhd
Hierarchical source code: SMUL_24x30_NG_MEDIUM_HIER.vhd
Sequential FIR filter, based on a single multiplier / accumulator (MAC):
Consider the following specification:
Input : up to 24-bit data @100 KHz
900-tap filter, requires 900 coefficients (up to 18-bit)
Working @100 MHz allows up to 1000 processing clock cycles between two data cycles. Using a PLL can be required to generate internal high speed clock. Note that when fully pipelined, the DSP blocks can support frequencies greater than 250 MHz.
As an example, a 900-tap FIR filter can be implemented with a single multiplier / accumulator and additional RAM for data storage, ROM for coefficients and a simple sequencer logic (counter and additional simple logic).
The data RAM and the coefficient ROM can be implemented with a RAM block (with or without EDAC), or Register_File for up to 64 x 16-bit depth. The sequencer logic is implemented with tile logic resources.
The process to compute one filtered data of the specified FIR filter example will take at least 900 clock cycles
The filtered output is available at the “Z” output during the last processing clock cycle. It can be captured into an external register (implemented with tile logic) activated during this same clock cycle.
Let’s have a look on the multiplier/accumulator VHDL sample source code:
We first must declare the packages “ieee.std_logic_arith” and “ieee.std_logic_signed”. This will allow to directly assign arithmetic operations to the std_logic_vectors. In addition, the packages will provide automatic sign extension for addition/subtraction if the two operands do not have the same number of bits.
library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; -- ARITH and SIGNED packages allow use ieee.std_logic_signed.all; -- direct arithmetic operations -- on std_logic_vector(s) entity MULT_ACC is port ( CK : in std_logic; RZ : in std_logic; WE : in std_logic; DIN : in std_logic_vector(23 downto 0); COEF : in std_logic_vector(17 downto 0); Z : out std_logic_vector(55 downto 0) ); end MULT_ACC; architecture ARCHI of MULT_ACC is signal AR : std_logic_vector(23 downto 0); signal BR : std_logic_vector(17 downto 0); signal MULT : std_logic_vector(41 downto 0); signal ACC : std_logic_vector(55 downto 0); begin process(CLK) begin if rising_edge(CLK) then if RZ = ‘1’ then Z <= (others => ‘0’); elsif WE = ‘1’ then AR <= DIN; -- A input register BR <= COEF; -- B input register MULT <= AR * BR; -- Registered MULT ACC <= ACC + MULT; -- MULT is automatically sign extended -- to 56-bit by the ARITH and SIGNED packages end if; end if; end process; Z <= ACC; end ARCHI;
Sequential symmetric FIR filter, based on a single Pre-adder / MAC:
For symmetrical FIR filters, the number of clock cycles can be reduced to half the number of taps.
Consider the following specification:
Input : 18-bit data @100 KHz
1800-tap filter, still requires 900 coefficients
Working @100 MHz allows up to 1000 processing clock cycles between two data cycles. Note that when fully pipelined, the DSP blocks can support frequencies greater than 250 MHz.
This 1800-tap FIR filter can be implemented with a single pre-adder / multiplier / accumulator implemented into a DSP block as follows:
The filtered output is available at the DSP block output during the last processing clock cycle. It can be captured into an external register activated during this same clock cycle.
The following is a sample VHDL code for this application. Again, the packages “ieee.std_logic_arith” and “ieee.std_logic_signed” must be declared.
signal AR1, AR2 : std_logic_vector(23 downto 0); signal BR_EXT, DR_EXT : std_logic_vector(18 downto 0); signal SUMR : std_logic_vector(18 downto 0); signal MULT : std_logic_vector(42 downto 0); signal ACC : std_logic_vector(51 downto 0); signal Z : std_logic_vector(51 downto 0); begin process(CLK) begin if rising_edge(CLK) then if RZ = ‘1’ then Z <= (others => ‘0’); elsif WE = ‘1’ then BR_EXT(18) <= DIN_RIGHT(17); -- Sign extension to 19-bit BR_EXT(17 downto 0) <= DIN_RIGHT; DR_EXT(18) <= DIN_LEFT(17); -- Sign extension to 19-bit DR_EXT(17 downto 0) <= DIN_LEFT; SUMR <= BR_EXT + DR_EXT; -- 19-bit addition AR1 <= COEF; -- First pipe register AR2 <= AR1; -- Second pipe register MULT <= AR2 * SUMR; -- Registered multiplier ACC <= ACC + MULT; -- MULT is automatically sign extended --to 43-bit by the ARITH and SIGNED packages end if; end if; end process; Z <= ACC; end ARCHI;
Parallel FIR filter implementation considerations :
For high data rates, it can be necessary to increase the number of DSP blocks to make the filtering process as fast as required for the data rate.
For example, for input data rates of 200 MHz or more, the sequential methodology will not allow to implement more than one tap in each DSP block. Typically the implementation of an N-tap FIR filter might require the usage of N x DSP blocks.
However, parallel FIR filters implementation requires some important considerations for efficient implementation. Let’s have a look on common ways to implement a parallel FIR filter:
Direct form parallel FIR filters:
The next figure shows the structure of the direct form FIR filter implementation.
This is the diagram of a 5-tap FIR filter in the well-known “direct form”. There are 5 coefficients (C0 to C4). C0 is the most left coefficient, and C4 the most right coefficient. Depending on the filter specification, the numbe
r of coefficients can be odd or even.
In the “direct form” FIR filter, the data to be filtered is forwarded from the input to each tap multiplier via a register set (one clock cycle delay). For 5 taps, there will be 5 register sets. Each register set forward the data to a tap multiplier and to the following register set of the next tap.
In most cases, for a specified filter, the coefficients are constant. No need of logic resources other than ‘1’ and ‘0’ to implement the complete set coefficient values.
Each multiplication can be efficiently implemented in a DSP block. For higher performance, the multiplier stages can be pipelined into the internal DSP block registers.
All multiplication results must be summed to get the filter result. Two options can be considered:
Adder tree structure: For an N-tap filter, the summation of the N multiplications requires (N x 2)-1 adders if the adder tree structure is adopted. About half of the required adders could be implemented into DSP blocks.
The maximum frequency of the filter will be limited by the longest combinatorial path that includes:
Register output delay (~0.5 ns)
Multiplier propagation delay (~5 ns)
Multiplier output routing delay (~1 ns)
Adder tree propagation delay, including logic delay (several stages of two-operand combinatorial adders), and routing delay (1,5 ns logic delay + 1 ns optimistic routing delay) x 3 (number of stages = 3 in our simple example)
Output register setup time (0.5 ns)
Total estimated delay (optimistic) = 14 ns (while the required period is 5 ns for 200 MHz)
Even for a very small 5-tap FIR filter, we will have to insert pipeline registers to shorten the combinatorial delays in order to support higher working frequencies (at the cost of clock cycles latency).
Pipelining the summation process (see the registers in red in the diagram) is not easy – and it varies upon the number of taps. Only some partial adders can be implemented into DSP blocks. Additional tile logic and general routing will be required. Let’s summarize the characteristics of this implementation:
Density : N-tap FIR filter implementation requires the usage of N x DSP blocks + N/2 external adders (implemented with tile logic).
Performance : Due to the external routing and usage of additional tile logic
Portability, maintenability : The adder tree pipelining source code must be adjusted to the exact number of taps. Any change in the number of taps implies deep modifications
Adder chain structure: Instead, we could decide to use an adder chain to take advantage of the DSP blocks cascade feature, and avoid the need of external routing and additional tile logic resources.
The N-tap FIR filter implementation requires now the usage of N x DSP blocks exclusively. No additional logic or general routing is required. However, the adder chain cannot be pipelined to maintain the FIR filter functionality.
Let’s summarize the characteristics of this implementation:
Density : N-tap FIR filter implementation requires exclusively the usage of N x DSP blocks. No external adders.
Performance : Due to the long chain of “N” x un-pipelined adders, the longest combinatorial delay is the adder delay multiplied by the number of taps. However this delay is 100% predictable. It‘s not implementation dependent, because it uses only dedicated direct routing.
Portability, maintenability : This structure can be very simply described in any HDL. N identical cells are can be instantiated into a “generate” instruction.
Conclusion: The direct form structure can be efficiently mapped on the DSP blocks chain (in terms of resources), but it’s not suitable for high performance implementation of FIR filters.
Transpose structure for parallel FIR filters:
Fortunately, different strategies can be adopted to implement parallel FIR filters. Those strategies allow to take full advantage of the NG-MEDIUM DSP blocks architecture:
The next figure shows the structure of the “Transpose structure”:
The coefficients order is the opposite of the one of the “direct form structure”
Each cell can be implemented with a fully pipelined DSP block (can work at very high frequency). No external logic or routing resources are required – except for the input data that must be connected to the N x DSP blocks.
Rounding the results of a DSP function: In most cases, DSP functions require to round the result before proceeding to truncation, in order to keep only the significant bits for subsequent operations or storage. The rounding process consists in adding a constant to the final result before truncating.
In the case of the transpose filter structure, the adder of the first tap DSP block can be used to add the “rounding value” -at no resource cost- to the filter result. By just proceeding to a simple truncation by dropping the unwanted LSBs results, the final result will be rounded to the nearest value. Note that the “rounding value” depends on the number of LSB bits to be dropped. Just to give some examples:
If the 8 LSB bits must be dropped (truncated), the “rounding value” must be b”1000_0000” (equivalent to 0,5 LSB of the remaining bits)
If the 10 LSB bits must be dropped (truncated), the “rounding value” must be b”10_000_0000”.
If the 17 LSB bits must be truncated, the “rounding value” must be b”1_0000_0000_0000_0000”.
We can analyze a practical example. Consider the following specification:
Input : 18-bit data @150 MHz
20-tap filter, requires 20 coefficients
Working @150 MHz each DSP block can implement one tap. We will have to use 20 DSP blocks. For an efficient implementation, we can take advantage of the cascade feature – to allow higher performance at lower dynamic power consumption (an important part of the power consumption is due to the amount of the capacitive general routing resources). In addition, using dedicated direct routing between DSP blocks gives a full predictability of the performance for the complete filter function.
The coefficients order is the opposite as the one of the “direct form structure”
N-tap FIR filter (transpose structure) implemented using N x DSP blocks
Each DSP implements a pipelined multiplication and addition (one tap)
The first cell uses C inputs for « ZERO » or « ROUNDING_VALUE » and sends its result to the CZO cascade output (CZI and Z are not used)
The last cell receives the CZO output of the previous cell on its CZI input, and gives the FIR filter result on it Z output to the fabric
All intermediate cells use the CZI input and send the intermediate result to their CZO output to the next cell.
Note that in most cases, the coefficients are constant. In this case, it’s not necessary to register them.
In addition we can see that in the transpose structure, all DSP blocks receive the same data value (large fanout and long nets on the input path – this will be the limiting factor for the filter performance).
Let’s summarize the characteristics of this implementation:
Density : N-tap FIR filter implementation requires exclusively the usage of N x DSP blocks. No external adders.
Performance : The adder chain is now fully pipelined. It can support very high working frequencies. However, the high fanout and possibly long nets on the input data bus can be the limiting factor for performance. However, some techniques allow to insert additional groups of tile registers to reduce the fanout and net delays .
Portability, maintenability : This structure can be very simply described in any HDL. N identical cells are can be instantiated into a “generate” instruction.
Since the release 2.9, the “nxpython” synthesis is able to infer DSP blocks with the provided source code.
Examples of VHDL source code can be provided for evaluation. Most examples are focused on DSP block inference for extended precision multipliers and FIR filters design. Please, contact support@nanoxplore.com if you want to receive those examples.
Systolic structure for parallel FIR filters:
The systolic structure allows to reduce the fanout on the input data bus. Just as the transpose structure, it gives very high performance independently of the number of taps and eliminates the potential timing limitation on the input data bus, at the expense of a longer latency.
The coefficients order is the same than in the “direct form structure” (opposite of the one of the “transpose structure”)
The systolic structure allows to take full advantage to the DSP chaining features. It allows to obtain a predictable very high performance (>200 MHz clock). The input data have a single load (the first DSP tap), so there is no need to try reducing it –like in the transpose FIR structure. However the latency is increased by N clock cycles for an N-tap filter.
The first DSP tap receives the data to be filtered, and forwards it to the next DSP block, after 2 clock cycles. It also receives the “ZERO” or “ROUNDING VALUE” depending on a possible truncation at the output.
Just as in the transpose structure, the adder chain in fully pipelined, and provides the best performance (no routing dependent delays).
The last DSP tap provides the filter output on its Z output.
Let’s summarize the characteristics of this implementation:
Density : N-tap FIR filter implementation requires exclusively the usage of N x DSP blocks. No external adders.
Performance : The adder chain is fully pipelined. It can support very high working frequencies. No fanout problems on the input data bus (one single load) . However, the latency is higher than with the transpose structure.
Portability, maintenability : This structure can be very simply described in any HDL. N identical cells are can be instantiated into a “generate” instruction.
Since the 2.9 release, the “nxpython” synthesis is able to infer DSP blocks with the provided source code.
Examples of VHDL source code can be provided for evaluation. Most examples are focused on DSP block inference for extended precision multipliers and FIR filters design. Please, contact support@nanoxplore.com if you want to receive those examples.
Symmetric systolic structure for parallel FIR filters:
When the coefficients are symmetric, the amount of required logic can be reduced to half the number of DSP blocks (by using the pre-adder), plus an extra external delay line.
The next figure shows the diagram of a 6-tap symmetric systolic structure FIR filter.
Let’s summarize the characteristics of this implementation:
Density : N-tap FIR filter implementation requires exclusively the usage of N/2 x DSP blocks. No external adders.
Performance : The adder chain is fully pipelined. It can support very high working frequencies, but the fanout on the input data bus is now N/2 loads . The latency is reduced by a factor 2 regarding the non-symmetric systolic structure.
Portability, maintenability : This structure can be very simply described in any HDL. N/2 identical cells are can be instantiated into a “generate” instruction.
Symmetric transpose structure for parallel FIR filters:
Just as in the case of symmetric systolic structure, the symmetric transpose structure allows to reduce by a factor 2 the amount of DSP block required for the implementation, but it also allows to reduce the latency, and do not require external data delay !
Even more surprising: it uses exactly the same DSP block configuration but the order of the coefficients is inverted regarding the symmetric systolic structure.
Have a look on the next figure. It shows the diagram of a 6-tap symmetric transpose structure FIR filter, and the associated DSP configuration.
Let’s summarize the characteristics of this implementation:
Density : N-tap FIR filter implementation requires exclusively the usage of N/2 x DSP blocks. No additional logic is required.
Performance : The adder chain is fully pipelined. It can support very high working frequencies. The fanout of data source is reduced by a factor 2 regarding the non symmetric transpose structure. .
Portability, maintenability : This structure can be very simply described in any HDL. N/2 identical cells are can be instantiated into a “generate” instruction.
Since the release 2.9, the “nxpython” synthesis is able to infer DSP blocks with the provided source code, to implement the filters by using DSP blocks specific resources and direct connections, for a very efficient result, in terms of amount of logic resources, performance and power consumption.
Examples of VHDL source code can be provided for evaluation. Most examples are focused on DSP block inference for extended precision multipliers and FIR filters design. Please, contact support@nanoxplore.com if you want to receive those examples.
DSP blocks inference
“nxmap” synthesis 2.9.0 supports a wide variety of DSP blocks inference for most common configurations. Among the supported DSP blocks features currently supported:
Input registers (single or multiple pipe levels)
Internal registers
Output register
Pre-adder
Post-adder
Main output chaining (CZO to CZI of the neighboring DSP block)
Some cases of data chains (CBO to CBI)
Not supported yet for inference:
Saturation
Overflow
Several VHDL examples for DSP inference (multipliers, filters…) can be provided on request. Please, contact support@nanoxplore.com
NG-MEDIUM configuration and interface pins
Purpose of NX1H35S configuration
NX1H35S chips are SRAM-based FPGAs. To achieve user-defined functionality their configuration bitstream must be downloaded first.
NX1H35S chips are always accessible through JTAG, and also support several configuration modes, as a function of the state of MODE[3:0] pins sampled at power-up. RST_N is a dedicated input pin that allows to reset the configuration engine, and launches the configuration process after RST_N is released. (It can’t be used to reset the FPGA user’s logic).
Configuration errata:
Some changes have been done in the configuration process documentation.
In all configuration modes, the configuration clock must be provided to the FPGA on the CLK dedicated input pin. Its frequency can range from 20 MHz to 50 MHz, in any case it must be strictly greater than twice the JTAG (TCK) frequency – if used.
MODE[3:0] | Configuration mode |
1000 0x8 | RESERVED |
1001 0x9 | RESERVED |
1010 0Xa | Master Serial SPI |
1011 0xB | Master Serial SPI with Vcc control |
1100 0xC | Slave SpaceWire |
1101 0xD | RESERVED |
1110 0xE | Slave Parallel 8 |
1111 0xF | RESERVED |
Table 8: NG-MEDIUM configuration modes
The Slave Parallel configuration mode is 8-bit only. Slave Parallel 16 is not supported on NG-MEDIUM
Please refer to the NG-MEDIUM Configuration guide for detailed information
“nxpython“ synthesis and implementation tools
Introduction
“nxpython” is the NanoXplore Python script based tool.
“nxpython” is a wrapper around Python executable that allows user to control “nxmap” software. As a wrapper, it fully supports Python syntax, structures and external modules.
“nxpython” allows the user to:
Define a project and its parameters
Specify the source files and their respective path
Define the IO pads constraints (location and electrical parameters)
Assign mapping directives – if required
Assign timing constraints
Launch the synthesis and implementation process
Generate a report of the used logic resources
Launch the timing analyzer and generate timing reports
Generate the bitstream
During the synthesis process, the FF outputs signals are automatically renamed by the tools, by adding “_reg” to the original name defined in the source code. As an example, a signal called “DATA_CHANNEL” in the source code will be renamed “DATA_CHANNEL_reg” after synthesis if it’s generated by a FF or group of FFs. Be careful not to name in your source code any combinatorial signal “DATA_CHANNEL_reg” to avoid possible post-synthesis conflicts.
However, any FF output called in the source code “DATA_CHANNEL_REG” will be renamed by the synthesis as “DATA_CHANNEL_REG_reg”.
A project is created and compiled by launching “nxpython” on the script file, for example :
nxpython script.py (enter)
The “nxmap” GUI gives a graphical representation of the internal view of the design. The internal design topology and routing can be analyzed. Note that in “nxmap” GUI you can also launch the timing analyzer. To launch the “nxmap” GUI, type in a terminal :
nxmap (enter)
Synthesis attributes
“nxmap” supports some VHDL attributes to prevent unwanted optimizations:
syn_noprune & syn_keep:
Apply to a signal.
Used to prevent the removal of unused logic. The unused logic will still appear in the synthesized netlist. It will be available for post-synthesis netlist analysis.
syn_preserve:
Typically applies to register output signal to prevent it to be “optimized” when the register input is a constant value, or when two registers have exactly the same behavior. The signals that are “protected” by the “syn_preserve” attribute will be then preserved for design analysis and post-synthesis simulation.
The following VHDL source code shows examples of use for those two attributes.
-- File: attributes.vhd library ieee; use ieee.std_logic_1164.all; entity attributes is port ( I1 : in std_logic; I2 : in std_logic; I3 : in std_logic; CK1 : in std_logic; CK2 : in std_logic; O : out std_logic_vector(1 downto 0) ); end attributes; architecture behavioral of attributes is signal temp1, temp2, temp3, temp4: std_logic; signal Ok : std_logic; attribute syn_noprune : boolean; attribute syn_noprune of Ok : signal is true; attribute syn_preserve : boolean; attribute syn_preserve of temp3 : signal is true; attribute syn_preserve of temp4 : signal is true; begin temp4 <= temp1 or temp2; process(CK1) begin if (rising_edge(CK1)) then temp1 <= I1; temp2 <= I2; O(0) <= temp1 and temp2 and temp3; end if; end process; process(CK1) begin if (falling_edge(CK1)) then O(1) <= temp4 or temp3; end if; end process; process(CK2) begin if (rising_edge(CK2)) then temp3 <= '1'; Ok <= temp1 xor temp2 xor temp3; end if; end process; end behavioral;
NX_USE:
This attribute is used as mapping directive when the user wants to target specific architecture elements such as: RAM, RAM_ECC, DSP...
This attribute applies to a signal. The synthesis process will use the targeted element, and map all other signals and logic functions that can be mapped into the same targeted resources.
Example of use:
type DINR_TYPE is array(0 to 3) of std_logic_vector(11 downto 0); signal DINR, DINR1, DIN_CHAIN : DINR_TYPE; type PREAD_TYPE is array(0 to 3) of std_logic_vector(12 downto 0); signal PREAD : PREAD_TYPE; type MULT_TYPE is array(0 to 3) of std_logic_vector(24 downto 0); signal MULT : MULT_TYPE; type ADD_TYPE is array(0 to 3) of std_logic_vector(29 downto 0); signal ADD : ADD_TYPE; attribute NX_USE :string; attribute NX_USE of MULT: signal is "NX_DSP";
The attribute applies to the “MULT” signal. However, the synthesis process will check if other signals and associated logic functions and registers can also be mapped into DSP blocks – in order to optimize the DSP blocks usage. In this example, all the signals and associated logic/registers will be mapped into DSP blocks.
NX_USE applies also to “RAM” and “RAM_ECC”. Example:
type MEM_TYPE is array(1023 downto 0) of std_logic_vector(17 downto 0); signal MEM : MEM_TYPE; attribute NX_USE : string; attribute NX_USE of MEM : signal is "RAM"; -- "RAM_ECC"
NX_PORT:
This attribute allows to define the pin assignment and electrical I/Os parameters in the VHDL source code.
port ( CLK : in std_logic; RST : in std_logic; IA : in std_logic_vector (24 downto 0); ... ); attribute NX_PORT of CLK : signal is "(location= IOB12_D08P)"; attribute NX_PORT of RST : signal is "(location= IOB12_D09P, turbo= True, inputSignalSlope=20)"; attribute NX_PORT of IA : signal is "[ all :( turbo=True, type= LVCMOS_1.8V_2mA, inputSignalSlope= 20)" & "; 0 : ( location= IOB1_D01P, type= LVCMOS_1.8V_2mA )" & "; 1 : ( location= IOB1_D01N, type= LVCMOS_1.8V_4mA )" & "; 2 : ( location= IOB1_D02P, type= LVCMOS_1.8V_8mA )" & "; 3 : ( location= IOB1_D02N, type= LVCMOS_1.8V_16mA )" & "; 4 : ( location= IOB2_D01P, type= LVCMOS_1.8V_2mA )" & "; 5 : ( location= IOB2_D01N, type= LVCMOS_1.8V_4mA )" & "; 6 : ( location= IOB2_D02P, type= LVCMOS_1.8V_8mA )" & "; 7 : ( location= IOB2_D02N, type= LVCMOS_1.8V_16mA )" & "; 8 : ( location= IOB3_D01P, type= LVCMOS_2.5V_2mA )" & "; 9 : ( location= IOB3_D01N, type= LVCMOS_2.5V_4mA )" & "; 10 : ( location= IOB3_D02P, type= LVCMOS_2.5V_8mA )" & "; 11 : ( location= IOB3_D02N, type= LVCMOS_2.5V_16mA )" & "; 12 : ( location= IOB4_D01P, type= LVCMOS_3.3V_2mA )" & "; 13 : ( location= IOB4_D01N, type= LVCMOS_3.3V_4mA )" & "; 14 : ( location= IOB4_D02P, type= LVCMOS_3.3V_8mA )" & "; 15 : ( location= IOB4_D02N, type= LVCMOS_3.3V_16mA )" & "; 16 : ( location= IOB3_D03P, type= LVDS_2.5V, differential=True)" & "; 17 : ( location= IOB2_D03P, type= SSTL_1.8V_I )" & "; 18 : ( location= IOB2_D03N, type= SSTL_1.8V_II )" & "; 19 : ( location= IOB3_D04P, type= SSTL_2.5V_I )" & "; 20 : ( location= IOB3_D04N, type= SSTL_2.5V_II )" & "; 21 : ( location= IOB2_D04P, type= HSTL_1.8V_I )" & "; 22 : ( location= IOB2_D04N, type= HSTL_1.8V_II )" & "; 23 : ( location= IOB1_D05P, type= HSTL_1.8V_I )" & "; 24 : ( location= IOB1_D05N, type= HSTL_1.8V_II)]";
NX_INIT:
The initial contents of a memory can be defined in an ASCII file
attribute NX_INIT :string; attribute NX_INIT of MY_ROM: signal is "../src/init.txt";
The format of the ASCII file is described in the nxmap_User_Manual, in the “addMemoryInitialization” method.
“nxpython” features
Synthesising and implementing your design
Creating a project and its main parameters
Environment settings
Creating a project
Project name
Variant name (Target FPGA)
Top level name
Defining the design source files
Assigning constraints
Pads location and electrical configuration
Mapping directives
Timing constraints
Setting project options for synthesis and implementation
Saving project steps (synthesized, placed, routed)
Generating reports : detailed report of used resources, pads report, timing analyzer reports
Generating your own IP Core
In addition, “nxpython” allows to generate a synthesized netlist of a user’s IP Core that can be instantiated as many times as required in a design. The resulting netlist can be encrypted or not (for confidentiality).
The user defines the name of the VHDL netlist of the IP Core, and can also define the entity name and the encryption options).
For detailed information, have a look on the “NanoXplore_nxmap_User_Manual”, the description of the command:
Project.exportAsIPCore(file, options)
“nxpython” script example
#===================================================================================================== # Main project settings #===================================================================================================== import os import sys from nanoxmap import * dir = os.path.dirname(os.path.realpath(__file__)) #===================================================================================================== # Creating the project #===================================================================================================== project = createProject(dir) #===================================================================================================== # Selecting the FPGA target #===================================================================================================== project.setVariantName('NG-MEDIUM') #===================================================================================================== # Chosing the Top name #===================================================================================================== project.setTopCellName('TOP_CKG_320') #===================================================================================================== # Adding files to the project # Could be set in one single command by using project.addFiles([‘…’, ‘…’)]) #===================================================================================================== project.addFile('source/TOP_CKG_320.vhd') # Adding a single file project.addFiles([ # Adding multiple files 'source/COUNTLOGIC.vhd', 'source/CKG_MODULE_320.vhd' ]) #===================================================================================================== # Synthesis and implementation options #===================================================================================================== project.setOptions({ #===================================================================================================== # Option | Default | Other possible # Name | Value | Values #===================================================================================================== 'UseNxLibrary' : 'No', # 'Yes', if elements of the nxLibrary were instantiated 'Unusedpads' : 'Floating', # 'WeakPullUp' # For the unused pads, the input buffer is desactivated # Note that for NG-MEDIUM only 'WeakPullUp' is available 'TimingDriven' : 'Yes', # 'No', # Synthesis, mapping, place and route are timing driven 'MappingEffort' : 'Low', # 'Medium', 'High' 'MergeRegisterToPad' : 'Never', # 'Input', 'output', 'Always' 'DefaultRAMMapping' : 'RF', # 'RAM', 'RAM_ECC' 'DefaultROMMapping' : 'LUT', # 'RF', 'RAM' 'AdderToDSPMapThreshold' : '0', # '1' to '100' ('0' deactivates the option) 'MultiplierToDSPMapThreshold' : '0', # '1' to '100' ('0' deactivates the option) 'LessThanToDSPMapThreshold' : '0', # '1' to '100' ('0' deactivates the option) 'DefaultFSMEncoding' : 'OneHot', # 'Binary' 'ManageUnconnectedOutputs' : 'Error', # 'Ground', 'Power' # Undriven output signals 'ManageUnconnectedSignals' : 'Error', # 'Ground', 'Power' # Undriven signals ‘ManageUninitializedLoops’ : ‘Never’, # ‘Ground’, ‘Power’ # Manages the uninitialized loops such as FSM or other loops }) #===================================================================================================== #======================================================================================================= # Adding mapping directives to design elements : # addMappingDirective('getModels(object_name)', 'type', 'mapping_target' ) #======================================================================================================= # 'object_name' | 'type' | 'mapping_target' #------------------------------------------------------------------------------------------------------- # “.* ” can be used replace any | 'ADD' | 'CY', 'DSP' # character or chain of characters | 'LTN' | 'LUT', 'CY', 'DSP' # | 'MUL' | 'CY', 'DSP' # | 'RAM' | 'RF', 'RAM', 'RAM_ECC', 'DFF' # | 'RAM' | 'LUT', 'RF', 'RAM', 'RAM_ECC' # || project.addMappingDirective('getModels(.*Test_myFifo.*)', 'RAM', 'RF') #===================================================================================================== #======================================================================================================= # Specifying memory initialization at power-up. The initial contents of the ROM/RAM is defined in # an ASCII file. For more details, see the Nxmap_User_Manual #======================================================================================================= # addMemoryInitialization(instance_name, type, file) # #======================================================================================================= # Specifying Pads constraints (IO electrical standard and parameters, location…) # Can be specified in a separate script #======================================================================================================= #if path.exists(dir + '/pads.py'): # from pads import pads # project.addPads(pads) #======================================================================================================= # Syntax example for LVDS # project.addPads({ # 'CNT_R_160_P[0]' : ('IOB1_D01P', 'LVDS_2.5V'), # For LVDS pads, specify only # 'CNT_R_160_P[1]' : ('IOB1_D02P', 'LVDS_2.5V'), # the signal_P IO standard # 'CNT_R_160_P[2]' : ('IOB1_D03P', 'LVDS_2.5V'), # and location # 'CNT_R_160_P[3]' : ('IOB1_D04P', 'LVDS_2.5V'), # The associated signal_N will # 'CNT_R_160_P[4]' : ('IOB1_D05P', 'LVDS_2.5V'), # automatically be assigned # 'CNT_R_160_P[5]' : ('IOB1_D06P', 'LVDS_2.5V') # to the complementary pad # }) project.addPads({ 'CNT_R_160[0]' : {'location': 'IOB09_D01P', 'type': 'LVCMOS_2.5V_4mA', 'outputDelayLine': 13, 'weakTermination': 'PullUp', 'slewRate': 'Slow'}, 'CNT_R_160[1]' : {'location': 'IOB09_D01N', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_160[2]' : {'location': 'IOB09_D02P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_160[3]' : {'location': 'IOB09_D02N', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_160[4]' : {'location': 'IOB09_D03P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_160[5]' : {'location': 'IOB09_D03N', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_160[6]' : {'location': 'IOB09_D04P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_160[7]' : {'location': 'IOB09_D04N', 'type': 'LVCMOS_2.5V_4mA'}, 'TCOUT_R_160' : {'location': 'IOB09_D05P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_80[0]' : {'location': 'IOB09_D06P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_80[1]' : {'location': 'IOB09_D06N', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_80[2]' : {'location': 'IOB09_D07P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_80[3]' : {'location': 'IOB09_D07N', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_80[4]' : {'location': 'IOB09_D08P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_80[5]' : {'location': 'IOB09_D08N', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_80[6]' : {'location': 'IOB09_D09P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_80[7]' : {'location': 'IOB09_D09N', 'type': 'LVCMOS_2.5V_4mA'}, 'TCOUT_R_80' : {'location': 'IOB09_D10P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_40[0]' : {'location': 'IOB09_D11P', 'type': 'LVCMOS_2.5V_4mA', 'outputDelayLine': 13, 'weakTermination': 'PullUp', 'slewRate': 'Fast'}, 'CNT_R_40[1]' : {'location': 'IOB09_D11N', 'type': 'LVCMOS_2.5V_4mA', 'outputDelayLine': 7, 'weakTermination': 'PullUp', 'slewRate': 'Slow'}, # Note that for NG-MEDIUM weakTermination only PullUp is available 'CNT_R_40[2]' : {'location': 'IOB09_D12P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_40[3]' : {'location': 'IOB09_D12N', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_40[4]' : {'location': 'IOB09_D13P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_40[5]' : {'location': 'IOB09_D13N', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_40[6]' : {'location': 'IOB09_D14P', 'type': 'LVCMOS_2.5V_4mA'}, 'CNT_R_40[7]' : {'location': 'IOB09_D14N', 'type': 'LVCMOS_2.5V_4mA'}, 'TCOUT_R_40' : {'location': 'IOB09_D15P', 'type': 'LVCMOS_2.5V_4mA'}, 'CLKIN_80' : {'location': 'IOB08_D01P', 'type': 'LVCMOS_2.5V_4mA'}, 'ENA' : {'location': 'IOB08_D01N', 'type': 'LVCMOS_2.5V_4mA'} }) #======================================================================================================= # Specifying additional pads constraints using pin of the “PROG” bank #======================================================================================================= interfaces = { 'LED0': 'USER_D0', 'LED1': 'USER_D1' } project.addInterfaces(interfaces) #======================================================================================================= # Assigning timing constraints #======================================================================================================= # Defining the clock periods project.createClock('getClockNet(CLKIN_80)', 'CLKIN_80', 12500, 0, 6250) # Period = 12500 ps, # first rising edge at 6250 ps project.createClock('getClockNet(CLK_40)', 'CLK_40', 25000, 0, 12500) # Period = 25000 ps, # first rising edge at 12500 ps project.createClock('getClockNet(CLK_80)', 'CLK_80', 12500, 0, 6250) # Period = 12500 ps, # first rising edge at 625 ps # Assigning timing constraints to outputs : note that the CNT_R_160(*) are clocked by CLKIN80 # in this example. # setOutputDelay must specify, a valid clock, rising or falling edge, destination_hold_time, # destination_setup_time and clocked outputs (pads_name) project.setOutputDelay('getClock(CLKIN_80)', 'rise', 1000, 4501, "getPorts(CNT_R_160\[[0-7]\])") project.setOutputDelay('getClock(CLK_40)', 'rise', 1000, 3500, 'getPort(CNT_R_40[0])') project.setOutputDelay('getClock(CLK_80)', 'rise', 1000, 2500, 'getPort(CNT_R_80[0])') # SetInputDelay : similar to setOutputDelay but for inputs #======================================================================================================= # SetClocksGroup : asynchronous, exclusive. Allows to define the relationship between clocks # Useful to avoid possible inferred constraints between clock domains #======================================================================================================= project.setClockGroup('getClock(CLK_80)', 'getClock(CLK_40)', 'exclusive') # asynchronous project.setClockGroup('getClock(CLK_80)', 'getClock(CLKIN_80)', 'exclusive') # asynchronous # addFalsePath : to ignore timing on specified paths # addMultiCyclePath(from_list, to_list) : Multiplies the clock period by the specified number when the # allowed delay is greater than the clock period (by example using the LD pin of the FFs to sample the # signals every 2 or more clock cycles) #======================================================================================================= #======================================================================================================= # Saving project settings, launching Synthesis, place and route – Saving the intermediate VHDL netlists #======================================================================================================= project.save('native.nxm') # Saving the project settings # The user can generate its own synthetized IP cores by using the “exportAsIPCore” method. # The resulting IP core can be encrypted or not, depending on the options settings # Note that the IP Core VHDL netlist will be generated during the synthesis process # Running the place and route processes is not required for IP Core generation IP_options = {'coreName': 'MAC_FIR_IP_Core', 'encrypt': 'None' } project.exportAsIPCore('IP_Cores/MAC_FIR_IP_Core.vhd', IP_options) if not project.synthesize(): # Launching the synthesis sys.exit(1) project.save('_synthese.vhd') # Saving the VHDL synthesized netlist project.save('synthesized.nxm') # Saving the synthesized project if not project.place(): # Launching the placement process sys.exit(1) project.save('placed.nxm') # Saving the placed project if not path.exists(dir + '/pads.py'): # Saving PADS constraints project.savePorts('pads.py') if not project.route(): # Launching the routing process sys.exit(1) project.save('routed.nxm') # Saving the routed design project.save('routed.vhd') # Saving the post-routing netlist project.save('routed.sdf') # Saving the post-routing SDF file (can be used for # back-annotated simulation) #======================================================================================================= # Launching the Timing Analyzer and generating timing reports #======================================================================================================= #STA STA_parameters = { 'searchPathsLimit': 50, 'temperature': 125, 'voltage': 1.1 } analyzer = project.createAnalyzer() analyzer.launch(STA_parameters) # Launches the Timing Analyzer and generate the timing reports for the # specified number of longest paths on each domain to domain paths # (50 in this example) for T = 125° and Vcc = 1.1V print 'Errors: ', getErrorCount() print 'Warnings: ', getWarningCount() #======================================================================================================= # Generating the bitstream #======================================================================================================= project.generateBitstream('TOP_CKG_320.nxb')
“nxpython” main reports:
“nxmap” delivers a set of synthesis, implementation, bitstream generation and timing analysis reports. Among the main reports:
General.log: global implementation report. Includes :
Errors and warning messages
Summary of resources utilization (LUTs, FFs, Carry, FE, Register_file, RAM, DSP, WFG, PLL …)
List of IO pads and parameters (user’s assigned or not)
Timing analysis summary on the recognized clock domains
Summary.timing:
Timing analysis summary on the recognized clock domains
Timing_Constraints_Reports.timing
Summary of timing analysis results based on timing contraints
Violation_summary.timing:
Lists the timing violation against timing constraints (createClock, setInputDelay, setOutputDelay, addMultiCyclePath, addFalsePath…)
DOMAIN[clock_domain_X to clock_domain_Y].timing
One report for each clock domain (clock_domain_X to clock_domain_X)
One report for each pair of clock domains (clock domain crossing from : clock_domain_X to clock_domain_Y)
Note : The “setClockGroup” directive can be used to prevent constraining unsignificant clock domain crossing :
“exclusive” if clock_domain_X and clock_domain_Y cannot be active simultaneously
“asynchronous” if there is no known relationship between both clocks.
Simulation
Behavioral simulation
The “nxmap” software supports ModelSim for behavioral simulation. Models of most NG-MEDIUM primitives are located in the “nxLibrary.vhdp” file, for NG_MEDIUM elements such as:
LUTs, FFs, FEs, Carry, Register_file…
PLL
WFG
RAM
DSP
IOs
…
Back-annotated simulation
“nxmap” V2.9.0 also provides back-annotated simulation models (preliminary version) based on post-routing VHDL netlist and associated .SDF file.
Future releases are planned for a better coverage of back-annotated simulation.
In order to launch the post-routing simulation, the user must first generate both the netlist and the .SDF files. This can be done by using the following commands in the “nxpython” script file :
project.save('routed.nxm') # Saves the routed netlist NXmap format) project.save('routed.vhd') # Saves the post-routing VHDL netlist project.save('routed.sdf') # Saves the associated .SDF file
The following is a sample of the ModelSim .DO command script.
Note that the back-annotated simulation library “nxLibrary.vhdp” has been copied into the directory “D:/NX_SIM_LIB/TIMING” directory (while the behavioral library has been copied into the “D:/NX_SIM_LIB/BEHAVIORAL” directory. Please, modify the according line of the .do example to match the path of the libraries.
vlib nx # Modify the "nxLibrary.vhdp " path to mach your own path vcom -work nx -2008 -vopt D:/NX_SIM_LIB/TIMING/nxLibrary.vhdp vlib work vcom -work work -vopt -explicit -93 routed.vhd vcom -work work -vopt -explicit -93 src/testbench.vhd vsim -voptargs="+acc" -t 1ps -lib work work.testbench -sdfmax FIR_UT=routed.sdf
NanoXplore can provide a set of project examples that includes:
VHDL source files for DSP inference and other examples
“nxpython” script files
ModelSim .do files for both behavioral and back-annotated simulation
Please contact support@nanoxplore.com get access to request those design examples.
Introduction to NXcore
NanoXplore provides as complementary tool an IP Core generator called “NXcore”.
Among the supported IP Cores :
User’s defined Parallel FIR filters ( working at high sample rated)
Transpose structure
Systolic structure
Transpose symmetric
Transpose Dual-Channel
Semi-parallel FIR filters (lower sample rate – uses internally high speed clock to take advantage of the DSP blocks performance)
NXscope : parametrizable embedded Logic Analyzer up to 240 signals and 48K samples with flexible trigger conditions)
SER_DES IP Core : allows to develop multi-channel high speed transmission and reception.The generated IP Core includes :
Internal clocks generation
Automatic data/clock alignment and word alignment
Control, handshake and status signals
Additional IP Cores will be included in next versions.
In addition NanoXplore provides access to 3rd parties IP Cores.