Q format numbers are notionally fixed point numbers, that is, they are stored and operated upon as regular binary signed integers, thus allowing standard integer hardware/ALU to perform rational number calculations. The number of integer bits, fractional bits and the underlying word size are to be chosen by the programmer on an application-specific basis the programmer's choices of the foregoing will depend on the range and resolution needed for the numbers. Some DSP architectures offer native support for common formats, such as Q1.15. In this case, the processor can support arithmetic in one step, offering saturation and renormalization in a single instruction. Most standard CPUs do not. If the architecture does not directly support the particular fixed point format chosen, the programmer will need to handle saturation and renormalization explicitly with bounds checking and bit shifting. There are two conflicting notations for fixed point. Both notations are written as Qm.n, where:
Q designates that the number is in the Q format notation the Texas Instruments representation for signed fixed-point numbers.
n is the number of bits used to designate the fractional portion of the number, i.e. the number of bits to the right of the binary point..
One convention includes the sign bit in the value of m, and the other convention does not. The choice of convention can be determined by summing m+n. If the value is equal to the register size, then the sign bit is included in the value of m. If it is one less than the register size, the sign bit is not included in the value of m. In addition, the letter U can be prefixed to the Q to indicate an unsigned value, such as UQ1.15, indicating values from 0.0 to +1.999969482421875. Signed Q values are stored in two's complement format, just like signed integer values on most processors. In two's complement, the sign bit is extended to the register size. For a given Qm.n format, using an m+n bit signed integer container with n fractional bits:
its range is
its resolution is
For a given UQm.n format, using an m+n bit unsigned integer container with n fractional bits:
its range is
its resolution is
For example, a Q15.1 format number:
requires 15+1 = 16 bits
its range is = =
its resolution is 2−1 = 0.5
Unlike floating point numbers, the resolution of Q numbers will remain constant over the entire range.
To convert a number from Qm.n format to floating point:
Convert the number to floating point as if it were an integer, in other words remove the binary point
Multiply by 2−n
Math operations
Q numbers are a ratio of two integers: the numerator is kept in storage, the denominator is equal to 2n. Consider the following example:
The Q8 denominator equals 28 = 256
1.5 equals 384/256
384 is stored, 256 is inferred because it is a Q8 number.
If the Q number's base is to be maintained the Q number math operations must keep the denominator constant. The following formulas show math operations on the general Q numbers and. Because the denominator is a power of two the multiplication can be implemented as an arithmetic shift to the left and the division as an arithmetic shift to the right; on many processors shifts are faster than multiplication and division. To maintain accuracy the intermediate multiplication and division results must be double precision and care must be taken in rounding the intermediate result before converting back to the desired Q number. Using C the operations are :
Addition
int16_t q_add
With saturation int16_t q_add_sat
Unlike floating point ±Inf, saturated results are not sticky and will unsaturate on adding a negative value to a positive saturated value and vice versa in that implementation shown. In assembly language, the Signed Overflow flag can be used to avoid the typecasts needed for that C implementation.
Subtraction
int16_t q_sub
Multiplication
// precomputed value:
define K
// saturate to range of int16_t int16_t sat16 int16_t q_mul