## 20 Jun / Issues on fixed-point IIR filter implementation: quantization effects

Digital filters is a great topic and I could dedicate the whole post to the advantages they have compared to their analog counterparts (despite some drawbacks). However, designing and implementing these fine pieces can be quite tricky, and if proper care is not taken the effects can be catastrophic. Since I am an engineer, and praising them is not going to take me anywhere, I would rather talk about their seemingly minor but actually very relevant issues.

As the title implies, I assume in this post that you are somewhat familiar with fixed-point arithmetic and IIR filters. If you want to recollect, allow me to suggest [1-4] respectively, as great references on those topics.

First, the choice for a fixed-point IIR filter provides us with a fair amount of information about what the designer has in mind. For instance, we can assume execution time is a major concern, which is reasonable for most real-time applications, and he is willing to trade some level of precision for speed. Lower cost, less silicon area, and lower power required could also factor in favor of fixed point. Then, the purpose here is to address the drawback – loss of precision – to prevent it from compromising the final result. Ideally, this should be accomplished without taking a toll on execution time or any other advantage, but often relative small trade-offs are inevitable.

Before we go into details, let us recall how microcontrollers and DSPs – at least the majority of them – interpret binary numbers. Considering the numbers as signed,

- The most significant bit (MSB) is two’s complement notation becomes the sign bit;
- 0 if number is positive;
- 1 if number is negative;

- Positive values are represented as simple binary;
- Negative values are represented as the inverted positive value with the same magnitude plus one.

This representation is called two’s complement, and I will provide an example:

Suppose we want to represent the number three. If positive (+3), it is quite straightforward

Now, to represent -3, all the bits above must be inverted and then added to one:

Note that the addition of the two binary numbers representing -3 and +3 equals 0, as expected.

Consider now that we have a microcontroller with 8-bit word length, that is, each register contains a bit for the sign and seven bits to store the number. Then, the greatest storable value is 0.1111111_{2}, which corresponds to +127, and the lowest is 1.0000000_{2}, or -128. In sum, the numbers we can represent are constrained to the integers in the interval [-128, 128). Recall that fixed-point architecture imposes a fixed binary point, but not its location. Furthermore, the binary point is virtual and does not really matter to the computer. For example, we can assume that three bits of our microcontroller words are integers and four are fractions. In other words, we cannot use different scales concurrently, but we can set the scale. To avoid misinterpretation, we can use the Q format notation to indicate the positioning of the binary point. As there are conflicting formats, I opted for the Texas Instruments version

*Qm.n*

where Q – mathematical symbol for rational numbers – represents the Q format notation, m stands for the integral part of the number, and n for its fractional portion. Thus, the last example can be expressed as

*Q3.4*

Notice that the sign bit is not included. For a number of reasons, binary numbers are usually considered as fractions in embedded systems, with the binary point virtually placed immediately after the sign bit. The range is reduced to [-1,1) but the quantization step is now or 0.000030517578125. The Q format for fractional numbers is *Q0.15*

Establishing the interpretation as fractions, we proceed now to a quick review of how calculations are performed by computers and the inherent ramifications. First, if we multiply 0.50 by 0.75 the result (0.3750) requires two additional bits to be stored. Let us examine the previous example for an imaginary 3-bit register, where 0.50 corresponds to 0.10_{2}, 0.75 to 0.11_{2} and the product totals 0.0110. Truncating the result to 3-bit we get 0.01, while rounding leads to 0.10, which means the errors are equal in magnitude in this case. Regardless of the method, this error introduced by quantization is called roundoff noise. In practice, variance is the same for both types of noise but rounding has zero mean, whereas truncation produces a biased result.

Furthermore, if we add 0.5 plus 0.75, the result (1.25) does not fit the range. In two’s complement, it would actually wrap around to produce a negative value, because 0.10_{2} + 0.11_{2} = 1.01_{2}, meaning that the sign bit is changed and the result is wrong (-0.75).

If overflow cannot be completely avoided, the designer can implement saturation, which can be done via hardware in many DSPs and microcontrollers. Truncation, rounding, overflow and saturation all introduce nonlinearities, which are in fact quantization errors.

As first stated by Leland Jackson, the father of digital filters, back in 1970, there are three major sources of quantization errors in filter implementation, and they refer to

- Signal Conversion
- Arithmetic Quantization
- Coefficient Quantization

The first is related to the signal conversion from analog to digital and vice versa. As we have previously seen, the digital version is not 100% accurate due to constraints in range and quantization step. Arithmetic quantization concerns the roundoff noise and overflow implications on the multipliers, adders, and storages in the filter. These two sources of errors are “online”, acting continuously as the signal is being processed. Conversely, coefficient quantization occurs “offline” and we should address it in detail.

In the design stage, digital filters are usually treated in Double-precision floating point arithmetic, which provides negligible loss of precision, so all the filter coefficients are exact up to 64 bits. Nevertheless, the word length in fixed-point hardware may be 32, 16, or even 8 bits, and quantization strikes again. Coefficient quantization can be destructive because it changes the very filter transfer function, that is, the transfer function implemented in hardware may not correspond to the one designed. In other words, considering that all poles and zeros depend on all coefficients, they might move considerably after quantization.

Wilkinson (1965) provided an enlightening example. Suppose we have the polynomial

*W(x) = (x-1)(x-2)…(x-20)*

or

*W(x) = x^ ^{20} – 210 x^^{19 }+ ⋯ + 2.43 . 10^^{18}*

The roots can be checked in the graph

Interestingly, if we simply multiply 1.000001 by *– 210 x^ ^{19}*, resulting in

*– 210.00021 x^*, the new roots are located as the figure shows

^{19}Luckily, this can be predicted before implementation and a since the problem worsens with system order, a very common workaround is the implementation of the transfer function in cascaded second-order sections (SOS), with the pairs of poles and zeros forming necessarily real coefficients. It allows for easy monitoring of roots and structure modularity, but the primary reason for its use is the robustness under coefficient quantization. A coefficient change does not cause a shift in every root, but in a single complex conjugate pair.

Each SOS can be realized as represented in the block diagram

This realization structure is called cascade direct form I and is widely used when fixed-point architecture is chosen. However, as I mentioned earlier, trade-offs are often required to maximize filter performance, and we face one here. Theory states that every system can be realized with a number of delays (storage elements) just as low as the system order, which would render the realization *canonic*. You may object now that the order of the SOS above is trivially 2, and yet there are delays. That is correct, we can modify the structure and make it canonic as follows.

This structure is called cascade direct form II.

Although the delay elements were cut by half, we have two adders now, and overflow may occur in the first node. Mind that overflows can be devastating for filters and should be prevented at all costs. Thus, this realization requires a laborious scaling process when implemented in fixed-point arithmetic, reducing considerably the signal-to-noise ratio (SNR), a measure of the signal power at the output. The trade-off is between precision and memory requirements.

Naturally, my intent in this post is only to provide a glimpse into the subject, and there is a lot more to it. Consider, for example, the possibilities of pole/zero pairings to form the SOSs or how we could order them. We could introduce dither, for example, to randomize quantization errors. In some cases, we could do loop unrolling, processing several samples at once to reduce loop overhead. The trade-offs go on and on and could be covered in a future post.

Author: Luiz Benicio Degli Esposte Rosa

#### References

[1] https://courses.cs.washington.edu/courses/cse467/08au/labs/l5/fp.pdf [2] https://www.dsprelated.com/showarticle/139.php [3] http://academic.evergreen.edu/projects/biophysics/technotes/program/2s_comp.htm [4] https://www.cs.cornell.edu/~tomf/notes/cps104/twoscomp.html

JACKSON, L. B. On the Interaction of Roundoff Noise and Dynamic Range in Digital Filters. *The Bell System Technical Journal*, [S. l.], v. 49, n. 2, p. 159-184, feb. 1970.

WILKINSON, J. H. *The Algebraic Eigenvalue Problem*. Oxford: Clarendon Press, 1965.