nnAudio.Spectrogram.CQT2010v2¶
- class nnAudio.Spectrogram.CQT2010v2(sr=22050, hop_length=512, fmin=32.7, fmax=None, n_bins=84, bins_per_octave=12, norm=True, basis_norm=1, window='hann', pad_mode='reflect', earlydownsample=True, trainable=False, output_format='Magnitude', verbose=True)¶
Bases:
torch.nn.modules.module.Module
This function is to calculate the CQT of the input signal. Input signal should be in either of the following shapes.
(len_audio)
(num_audio, len_audio)
(num_audio, 1, len_audio)
The correct shape will be inferred autommatically if the input follows these 3 shapes. Most of the arguments follow the convention from librosa. This class inherits from
torch.nn.Module
, therefore, the usage is same astorch.nn.Module
.This alogrithm uses the resampling method proposed in [1]. Instead of convoluting the STFT results with a gigantic CQT kernel covering the full frequency spectrum, we make a small CQT kernel covering only the top octave. Then we keep downsampling the input audio by a factor of 2 to convoluting it with the small CQT kernel. Everytime the input audio is downsampled, the CQT relative to the downsampled input is equivalent to the next lower octave. The kernel creation process is still same as the 1992 algorithm. Therefore, we can reuse the code from the 1992 alogrithm [2] [1] Schörkhuber, Christian. “CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING.” (2010). [2] Brown, Judith C.C. and Miller Puckette. “An efficient algorithm for the calculation of a constant Q transform.” (1992).
Early downsampling factor is to downsample the input audio to reduce the CQT kernel size. The result with and without early downsampling are more or less the same except in the very low frequency region where freq < 40Hz.
- Parameters
sr (int) – The sampling rate for the input audio. It is used to calucate the correct
fmin
andfmax
. Setting the correct sampling rate is very important for calculating the correct frequency.hop_length (int) – The hop (or stride) size. Default value is 512.
fmin (float) – The frequency for the lowest CQT bin. Default is 32.70Hz, which coresponds to the note C0.
fmax (float) – The frequency for the highest CQT bin. Default is
None
, therefore the higest CQT bin is inferred from then_bins
andbins_per_octave
. Iffmax
is notNone
, then the argumentn_bins
will be ignored andn_bins
will be calculated automatically. Default isNone
n_bins (int) – The total numbers of CQT bins. Default is 84. Will be ignored if
fmax
is notNone
.bins_per_octave (int) – Number of bins per octave. Default is 12.
norm (bool) – Normalization for the CQT result.
basis_norm (int) – Normalization for the CQT kernels.
1
means L1 normalization, and2
means L2 normalization. Default is1
, which is same as the normalization used in librosa.window (str) – The windowing function for CQT. It uses
scipy.signal.get_window
, please refer to scipy documentation for possible windowing functions. The default value is ‘hann’pad_mode (str) – The padding method. Default value is ‘reflect’.
trainable (bool) –
- Determine if the CQT kernels are trainable or not. If
True
, the gradients for CQT kernels will also be caluclated and the CQT kernels will be updated during model training. Default value is
False
- output_formatstr
Determine the return type. ‘Magnitude’ will return the magnitude of the STFT result, shape =
(num_samples, freq_bins, time_steps)
; ‘Complex’ will return the STFT result in complex number, shape =(num_samples, freq_bins, time_steps, 2)
; ‘Phase’ will return the phase of the STFT reuslt, shape =(num_samples, freq_bins,time_steps, 2)
. The complex number is stored as(real, imag)
in the last axis. Default value is ‘Magnitude’.
- Determine if the CQT kernels are trainable or not. If
verbose (bool) – If
True
, it shows layer information. IfFalse
, it suppresses all prints.device (str) – Choose which device to initialize this layer. Default value is ‘cpu’.
- Returns
spectrogram (torch.tensor)
It returns a tensor of spectrograms.
shape =
(num_samples, freq_bins,time_steps)
ifoutput_format='Magnitude'
;shape =
(num_samples, freq_bins,time_steps, 2)
ifoutput_format='Complex' or 'Phase'
;
Examples
>>> spec_layer = Spectrogram.CQT2010v2() >>> specs = spec_layer(x)
Methods
__init__
Initializes internal Module state, shared by both nn.Module and ScriptModule.
Convert a batch of waveforms to CQT spectrograms.
- forward(x, output_format=None)¶
Convert a batch of waveforms to CQT spectrograms.
- Parameters
x (torch tensor) –
Input signal should be in either of the following shapes.
(len_audio)
(num_audio, len_audio)
3.
(num_audio, 1, len_audio)
It will be automatically broadcast to the right shape