Лабораторная 5.2 — FIR RTL mapping¶
Lab 5.2 — FIR RTL Mapping¶
Goal¶
Map the fixed-point FIR from Block 4 into an RTL architecture and define its ports, latency, accumulator width and testbench strategy.
This lab is the first step from algorithmic FIR filtering to a synthesizable streaming FPGA block.
Executable HDL package¶
| File | Purpose |
|---|---|
blocks/block_05_fpga_hdl_flow/rtl/fir_iq_4tap.v |
executable educational 4-tap IQ FIR RTL block |
blocks/block_05_fpga_hdl_flow/tb/tb_fir_iq_4tap.v |
self-checking Verilog testbench |
blocks/block_05_fpga_hdl_flow/python/generate_fir_iq_4tap_vectors.py |
deterministic reference-vector generator |
blocks/block_05_fpga_hdl_flow/tb/fir_iq_4tap_input_vectors.txt |
generated input vectors |
blocks/block_05_fpga_hdl_flow/tb/fir_iq_4tap_expected_vectors.txt |
generated expected output vectors |
Run from the repository root:
python blocks/block_05_fpga_hdl_flow/python/generate_fir_iq_4tap_vectors.py
iverilog -g2012 \
-o blocks/block_05_fpga_hdl_flow/tb/tb_fir_iq_4tap.out \
blocks/block_05_fpga_hdl_flow/rtl/fir_iq_4tap.v \
blocks/block_05_fpga_hdl_flow/tb/tb_fir_iq_4tap.v
vvp blocks/block_05_fpga_hdl_flow/tb/tb_fir_iq_4tap.out
Expected result:
PASS: fir_iq_4tap test completed without errors
The GitHub Actions workflow .github/workflows/block5_hdl.yml generates vectors and runs this simulation automatically.
Engineering question¶
How does a Q1.15 FIR model become a clocked RTL datapath with explicit multipliers, accumulator growth, rounding and saturation?
Inputs from Block 4¶
| Item | Example value | RTL consequence |
|---|---|---|
| Input format | Q1.15 | signed 16-bit input ports |
| Coefficient format | Q1.15 | signed 16-bit coefficient ROM |
| Number of taps | 4 in executable lab, 129 in full design | shift register length and multiplier count |
| Product format | Q2.30 | 32-bit products |
| Guard bits | ceil(log2(N)) | accumulator must be wider than product |
| Output format | Q1.15 | rounding/saturation before output |
FIR datapath¶
flowchart LR
IN[Input sample] --> SR[Shift register]
SR --> MUL[Tap multipliers]
COEF[Coefficient ROM] --> MUL
MUL --> SUM[Adder tree / accumulator]
SUM --> ROUND[Round to Q1.15]
ROUND --> SAT[Saturate]
SAT --> OUT[Output sample]
Direct-form FIR equation¶
y[n] = sum_{k=0}^{N-1} h[k] * x[n-k]
For complex IQ, the same real coefficient FIR is applied independently to I and Q:
y_i[n] = sum h[k] * x_i[n-k]
y_q[n] = sum h[k] * x_q[n-k]
Executable 4-tap example¶
The executable lab uses this Q1.15 coefficient set:
h = [0.125, 0.375, 0.375, 0.125]
h_q15 = [4096, 12288, 12288, 4096]
It is intentionally small enough to understand in a waveform viewer, but it still demonstrates the complete fixed-point FIR pattern:
- shift register;
- coefficient multiplication;
- accumulator growth;
- rounding back to Q1.15;
- saturation;
- output valid alignment;
- reference-vector comparison.
Architecture options¶
| Architecture | DSP usage | Throughput | Latency | When to use |
|---|---|---|---|---|
| Fully parallel | high | one sample/clock | low/medium | high-rate streaming |
| Time-multiplexed MAC | low | one sample per many clocks | high | low-rate or resource-limited design |
| Symmetric FIR | medium | one sample/clock | medium | linear-phase symmetric coefficients |
| Polyphase FIR | medium/high | efficient rate change | medium | decimator/interpolator |
Accumulator sizing¶
For signed Q1.15 input and Q1.15 coefficients:
product width = 16 + 16 = 32 bits
product fractional bits = 15 + 15 = 30 bits
guard bits = ceil(log2(Ntaps))
accumulator width >= 32 + guard_bits
For the executable 4-tap lab:
guard_bits = 2
accumulator width >= 34 bits
For a 129-tap practical FIR:
guard_bits = 8
accumulator width >= 40 bits
RTL skeleton¶
module fir_iq_stream #(
parameter integer W = 16,
parameter integer NTAPS = 129,
parameter integer ACC_W = 40
)(
input wire clk,
input wire rst,
input wire in_valid,
input wire signed [W-1:0] in_i,
input wire signed [W-1:0] in_q,
output reg out_valid,
output reg signed [W-1:0] out_i,
output reg signed [W-1:0] out_q
);
// Educational skeleton: actual coefficient ROM, shift register,
// multiplier array, adder tree, rounding and saturation are added
// step-by-step in implementation labs.
endmodule
Testbench strategy¶
Use reference vectors generated by Python/MATLAB Lab 4.1 or by the local FIR vector generator.
Recommended tests:
- impulse input -> output equals coefficient sequence;
- constant input -> output approaches DC gain;
- two-tone input -> interferer suppression matches reference;
- random IQ vector -> sample-by-sample comparison after latency compensation;
- saturation stress vector -> output clips deterministically.
Latency documentation¶
Latency must be stated explicitly:
| Stage | Latency, clocks |
|---|---|
| Input register | 1 |
| Shift register update | 0/1 |
| Multiplier pipeline | 1–3 |
| Adder tree | depends on tree depth |
| Rounding/saturation | 1 |
| Output register | 1 |
Vivado resource report template¶
| Resource | Estimated | Synthesized | Comment |
|---|---|---|---|
| LUT | control + adders | ||
| FF | registers + valid pipeline | ||
| DSP | multipliers | ||
| BRAM | coefficient storage if used | ||
| Latency | clocks | ||
| Fmax | MHz |
Report checklist¶
- [ ] State FIR format and number of taps.
- [ ] Compute product and accumulator widths.
- [ ] Select architecture: parallel, time-multiplexed, symmetric or polyphase.
- [ ] Draw datapath diagram.
- [ ] Define streaming ports.
- [ ] Define latency.
- [ ] Define test vectors.
- [ ] Define pass/fail error tolerance.
- [ ] Add resource estimate table.
Engineering conclusion template¶
The FIR RTL mapping uses ____ taps with Q1.15 input and coefficients.
The product width is ____ bits and the accumulator width is ____ bits.
The selected architecture is ____ because ______.
The expected latency is ____ clocks and the main FPGA cost is ______.