Modified discrete cosine transform


The modified discrete cosine transform is a lapped transform based on the type-IV discrete cosine transform, with the additional property of being lapped: it is designed to be performed on consecutive blocks of a larger dataset, where subsequent blocks are overlapped so that the last half of one block coincides with the first half of the next block. This overlapping, in addition to the energy-compaction qualities of the DCT, makes the MDCT especially attractive for signal compression applications, since it helps to avoid artifacts stemming from the block boundaries. As a result of these advantages, the MDCT is the most widely used lossy compression technique in audio data compression. It is employed in most modern audio coding standards, including MP3, Dolby Digital, Vorbis, Windows Media Audio, ATRAC, Cook, Advanced Audio Coding, High-Definition Coding, LDAC, Dolby AC-4, and MPEG-H 3D Audio, as well as speech coding standards such as AAC-LD, G.722.1, G.729.1, CELT, and Opus.
The discrete cosine transform was first proposed by Nasir Ahmed in 1972, and demonstrated by Ahmed with T. Natarajan and K. R. Rao in 1974. The MDCT was later proposed by John P. Princen, A.W. Johnson and Alan B. Bradley at the University of Surrey in 1987, following earlier work by Princen and Bradley to develop the MDCT's underlying principle of time-domain aliasing cancellation, described below.
In MP3, the MDCT is not applied to the audio signal directly, but rather to the output of a 32-band polyphase quadrature filter bank. The output of this MDCT is postprocessed by an alias reduction formula to reduce the typical aliasing of the PQF filter bank. Such a combination of a filter bank with an MDCT is called a hybrid filter bank or a subband MDCT. AAC, on the other hand, normally uses a pure MDCT; only the MPEG-4 AAC-SSR variant uses a four-band PQF bank followed by an MDCT. Similar to MP3, ATRAC uses stacked quadrature mirror filters followed by an MDCT.

Definition

As a lapped transform, the MDCT is a bit unusual compared to other Fourier-related transforms in that it has half as many outputs as inputs. In particular, it is a linear function . The 2N real numbers x0,..., x2N-1 are transformed into the N real numbers X0,..., XN-1 according to the formula:

Inverse transform

The inverse MDCT is known as the IMDCT. Because there are different numbers of inputs and outputs, at first glance it might seem that the MDCT should not be invertible. However, perfect invertibility is achieved by adding the overlapped IMDCTs of subsequent overlapping blocks, causing the errors to cancel and the original data to be retrieved; this technique is known as time-domain aliasing cancellation.
The IMDCT transforms N real numbers X0,..., XN-1 into 2N real numbers y0,..., y2N-1 according to the formula:
In the case of a windowed MDCT with the usual window normalization, the normalization coefficient in front of the IMDCT should be multiplied by 2.

Computation

Although the direct application of the MDCT formula would require O operations, it is possible to compute the same thing with only O complexity by recursively factorizing the computation, as in the fast Fourier transform. One can also compute MDCTs via other transforms, typically a DFT or a DCT, combined with O pre- and post-processing steps. Also, as described below, any algorithm for the DCT-IV immediately provides a method to compute the MDCT and IMDCT of even size.

Window functions

In typical signal-compression applications, the transform properties are further improved by using a window function wn that is multiplied with xn and yn in the MDCT and IMDCT formulas, above, in order to avoid discontinuities at the n = 0 and 2N boundaries by making the function go smoothly to zero at those points. In principle, x and y could have different window functions, and the window function could also change from one block to the next, but for simplicity we consider the common case of identical window functions for equal-sized blocks.
The transform remains invertible, for a symmetric window wn = w2N−1−n, as long as w satisfies the Princen-Bradley condition:
Various window functions are used. A window that produces a form known as a modulated lapped transform is given by
and is used for MP3 and MPEG-2 AAC, and
for Vorbis. AC-3 uses a Kaiser-Bessel derived window, and MPEG-4 AAC can also use a KBD window.
Note that windows applied to the MDCT are different from windows used for some other types of signal analysis, since they must fulfill the Princen-Bradley condition. One of the reasons for this difference is that MDCT windows are applied twice, for both the MDCT and the IMDCT.

Relationship to DCT-IV and Origin of TDAC

As can be seen by inspection of the definitions, for even N the MDCT is essentially equivalent to a DCT-IV, where the input is shifted by N/2 and two N-blocks of data are transformed at once. By examining this equivalence more carefully, important properties like TDAC can be easily derived.
In order to define the precise relationship to the DCT-IV, one must realize that the DCT-IV corresponds to alternating even/odd boundary conditions: even at its left boundary, odd at its right boundary, and so on. This follows from the identities and. Thus, if its inputs are an array x of length N, we can imagine extending this array to and so on, where xR denotes x in reverse order.
Consider an MDCT with 2N inputs and N outputs, where we divide the inputs into four blocks each of size N/2. If we shift these to the right by N/2, then extend past the end of the N DCT-IV inputs, so we must "fold" them back according to the boundary conditions described above.
Similarly, the IMDCT formula above is precisely 1/2 of the DCT-IV, where the output is extended to a length 2N and shifted back to the left by N/2. The inverse DCT-IV would simply give back the inputs from above. When this is extended via the boundary conditions and shifted, one obtains:
Half of the IMDCT outputs are thus redundant, as baR = −R, and likewise for the last two terms. If we group the input into bigger blocks A,B of size N, where A= and B=, we can write this result in a simpler way:
One can now understand how TDAC works. Suppose that one computes the MDCT of the subsequent, 50% overlapped, 2N block. The IMDCT will then yield, analogous to the above: / 2. When this is added with the previous IMDCT result in the overlapping half, the reversed terms cancel and one obtains simply B, recovering the original data.

Origin of TDAC

The origin of the term "time-domain aliasing cancellation" is now clear. The use of input data that extend beyond the boundaries of the logical DCT-IV causes the data to be aliased in the same way that frequencies beyond the Nyquist frequency are aliased to lower frequencies, except that this aliasing occurs in the time domain instead of the frequency domain: we cannot distinguish the contributions of
a and of bR to the MDCT of, or equivalently, to
the result of
IMDCT = / 2.
The combinations cdR and so on, have precisely the right signs for the combinations to cancel when they are added.
For odd N, N/2 is not an integer so the MDCT is not simply a shift permutation of a DCT-IV. In this case, the additional shift by half a sample means that the MDCT/IMDCT becomes equivalent to the DCT-III/II, and the analysis is analogous to the above.

Smoothness and discontinuities

We have seen above that the MDCT of 2N inputs is equivalent to a DCT-IV of the N inputs
.
The DCT-IV is designed for the
case where the function at the right boundary is odd, and therefore
the values near the right boundary are close to 0. If the input signal is smooth,
this is the case: the rightmost components of a and bR are
consecutive in the input sequence, and
therefore their difference is small.
Let us look at the middle of the interval:
if we rewrite the above expression as
= −R,
the second term, R, gives a smooth
transition in the middle.
However, in the first term,, there is a
potential discontinuity where the right end of
d meets the left end of a.
This is the reason for using a window function that reduces the components
near the boundaries of the input sequence towards 0.

TDAC for the windowed MDCT

Above, the TDAC property was proved for the ordinary MDCT, showing that adding IMDCTs of subsequent blocks in their overlapping half recovers the original data. The derivation of this inverse property for the windowed MDCT is only slightly more complicated.
Consider to overlapping consecutive sets of 2N inputs and, for blocks A,B,C of size N.
Recall from above that when and are MDCTed, IMDCTed, and added in their overlapping half, we obtain, the original data.
Now we suppose that we multiply both the MDCT inputs and the IMDCT outputs by a window function of length 2N. As above, we assume a symmetric window function, which is therefore of the form where W is a length-N vector and R denotes reversal as before. Then the Princen-Bradley condition can be written as, with the squares and additions performed elementwise.
Therefore, instead of MDCTing, we now MDCT . When this is IMDCTed and multiplied again by the window function, the last-N half becomes:
Similarly, the windowed MDCT and IMDCT of yields, in its first-N half:
When we add these two halves together, we obtain:
recovering the original data.