
class nnAudio.Spectrogram.CQT1992v2(sr=22050, hop_length=512, fmin=32.7, fmax=None, n_bins=84, bins_per_octave=12, filter_scale=1, norm=1, window='hann', center=True, pad_mode='reflect', trainable=False, output_format='Magnitude', verbose=True)

Bases: torch.nn.modules.module.Module

This function is to calculate the CQT of the input signal. Input signal should be in either of the following shapes.

  1. (len_audio)

  2. (num_audio, len_audio)

  3. (num_audio, 1, len_audio)

The correct shape will be inferred autommatically if the input follows these 3 shapes. Most of the arguments follow the convention from librosa. This class inherits from torch.nn.Module, therefore, the usage is same as torch.nn.Module.

This alogrithm uses the method proposed in [1]. I slightly modify it so that it runs faster than the original 1992 algorithm, that is why I call it version 2. [1] Brown, Judith C.C. and Miller Puckette. “An efficient algorithm for the calculation of a constant Q transform.” (1992).

  • sr (int) – The sampling rate for the input audio. It is used to calucate the correct fmin and fmax. Setting the correct sampling rate is very important for calculating the correct frequency.

  • hop_length (int) – The hop (or stride) size. Default value is 512.

  • fmin (float) – The frequency for the lowest CQT bin. Default is 32.70Hz, which coresponds to the note C0.

  • fmax (float) – The frequency for the highest CQT bin. Default is None, therefore the higest CQT bin is inferred from the n_bins and bins_per_octave. If fmax is not None, then the argument n_bins will be ignored and n_bins will be calculated automatically. Default is None

  • n_bins (int) – The total numbers of CQT bins. Default is 84. Will be ignored if fmax is not None.

  • bins_per_octave (int) – Number of bins per octave. Default is 12.

  • filter_scale (float > 0) – Filter scale factor. Values of filter_scale smaller than 1 can be used to improve the time resolution at the cost of degrading the frequency resolution. Important to note is that setting for example filter_scale = 0.5 and bins_per_octave = 48 leads to exactly the same time-frequency resolution trade-off as setting filter_scale = 1 and bins_per_octave = 24, but the former contains twice more frequency bins per octave. In this sense, values filter_scale < 1 can be seen to implement oversampling of the frequency axis, analogously to the use of zero padding when calculating the DFT.

  • norm (int) – Normalization for the CQT kernels. 1 means L1 normalization, and 2 means L2 normalization. Default is 1, which is same as the normalization used in librosa.

  • window (string, float, or tuple) – The windowing function for CQT. If it is a string, It uses scipy.signal.get_window. If it is a tuple, only the gaussian window wanrantees constant Q factor. Gaussian window should be given as a tuple (‘gaussian’, att) where att is the attenuation in the border given in dB. Please refer to scipy documentation for possible windowing functions. The default value is ‘hann’.

  • center (bool) – Putting the CQT keneral at the center of the time-step or not. If False, the time index is the beginning of the CQT kernel, if True, the time index is the center of the CQT kernel. Default value if True.

  • pad_mode (str) – The padding method. Default value is ‘reflect’.

  • trainable (bool) – Determine if the CQT kernels are trainable or not. If True, the gradients for CQT kernels will also be caluclated and the CQT kernels will be updated during model training. Default value is False.

  • output_format (str) – Determine the return type. Magnitude will return the magnitude of the STFT result, shape = (num_samples, freq_bins,time_steps); Complex will return the STFT result in complex number, shape = (num_samples, freq_bins,time_steps, 2); Phase will return the phase of the STFT reuslt, shape = (num_samples, freq_bins,time_steps, 2). The complex number is stored as (real, imag) in the last axis. Default value is ‘Magnitude’.

  • verbose (bool) – If True, it shows layer information. If False, it suppresses all prints


  • spectrogram (torch.tensor)

  • It returns a tensor of spectrograms.

  • shape = (num_samples, freq_bins,time_steps) if output_format='Magnitude';

  • shape = (num_samples, freq_bins,time_steps, 2) if output_format='Complex' or 'Phase';


>>> spec_layer = Spectrogram.CQT1992v2()
>>> specs = spec_layer(x)



Method for debugging

forward(x, output_format=None, normalization_type='librosa')

Convert a batch of waveforms to CQT spectrograms.

  • x (torch tensor) –

    Input signal should be in either of the following shapes.

    1. (len_audio)

    2. (num_audio, len_audio)

    3. (num_audio, 1, len_audio) It will be automatically broadcast to the right shape

  • normalization_type (str) –

    Type of the normalisation. The possible options are:

    ’librosa’ : the output fits the librosa one

    ’convolutional’ : the output conserves the convolutional inequalities of the wavelet transform:

    for all p ϵ [1, inf]

    • || CQT ||_p <= || f ||_p || g ||_1

    • || CQT ||_p <= || f ||_1 || g ||_p

    • || CQT ||_2 = || f ||_2 || g ||_2

    ’wrap’ : wraps positive and negative frequencies into positive frequencies. This means that the CQT of a sinus (or a cosinus) with a constant amplitude equal to 1 will have the value 1 in the bin corresponding to its frequency.


