nnAudio.Spectrogram.CQT1992v2¶
- class nnAudio.Spectrogram.CQT1992v2(sr=22050, hop_length=512, fmin=32.7, fmax=None, n_bins=84, bins_per_octave=12, filter_scale=1, norm=1, window='hann', center=True, pad_mode='reflect', trainable=False, output_format='Magnitude', verbose=True)¶
Bases:
torch.nn.modules.module.Module
This function is to calculate the CQT of the input signal. Input signal should be in either of the following shapes.
(len_audio)
(num_audio, len_audio)
(num_audio, 1, len_audio)
The correct shape will be inferred autommatically if the input follows these 3 shapes. Most of the arguments follow the convention from librosa. This class inherits from
torch.nn.Module
, therefore, the usage is same astorch.nn.Module
.This alogrithm uses the method proposed in [1]. I slightly modify it so that it runs faster than the original 1992 algorithm, that is why I call it version 2. [1] Brown, Judith C.C. and Miller Puckette. “An efficient algorithm for the calculation of a constant Q transform.” (1992).
- Parameters
sr (int) – The sampling rate for the input audio. It is used to calucate the correct
fmin
andfmax
. Setting the correct sampling rate is very important for calculating the correct frequency.hop_length (int) – The hop (or stride) size. Default value is 512.
fmin (float) – The frequency for the lowest CQT bin. Default is 32.70Hz, which coresponds to the note C0.
fmax (float) – The frequency for the highest CQT bin. Default is
None
, therefore the higest CQT bin is inferred from then_bins
andbins_per_octave
. Iffmax
is notNone
, then the argumentn_bins
will be ignored andn_bins
will be calculated automatically. Default isNone
n_bins (int) – The total numbers of CQT bins. Default is 84. Will be ignored if
fmax
is notNone
.bins_per_octave (int) – Number of bins per octave. Default is 12.
filter_scale (float > 0) – Filter scale factor. Values of filter_scale smaller than 1 can be used to improve the time resolution at the cost of degrading the frequency resolution. Important to note is that setting for example filter_scale = 0.5 and bins_per_octave = 48 leads to exactly the same time-frequency resolution trade-off as setting filter_scale = 1 and bins_per_octave = 24, but the former contains twice more frequency bins per octave. In this sense, values filter_scale < 1 can be seen to implement oversampling of the frequency axis, analogously to the use of zero padding when calculating the DFT.
norm (int) – Normalization for the CQT kernels.
1
means L1 normalization, and2
means L2 normalization. Default is1
, which is same as the normalization used in librosa.window (string, float, or tuple) – The windowing function for CQT. If it is a string, It uses
scipy.signal.get_window
. If it is a tuple, only the gaussian window wanrantees constant Q factor. Gaussian window should be given as a tuple (‘gaussian’, att) where att is the attenuation in the border given in dB. Please refer to scipy documentation for possible windowing functions. The default value is ‘hann’.center (bool) – Putting the CQT keneral at the center of the time-step or not. If
False
, the time index is the beginning of the CQT kernel, ifTrue
, the time index is the center of the CQT kernel. Default value ifTrue
.pad_mode (str) – The padding method. Default value is ‘reflect’.
trainable (bool) – Determine if the CQT kernels are trainable or not. If
True
, the gradients for CQT kernels will also be caluclated and the CQT kernels will be updated during model training. Default value isFalse
.output_format (str) – Determine the return type.
Magnitude
will return the magnitude of the STFT result, shape =(num_samples, freq_bins,time_steps)
;Complex
will return the STFT result in complex number, shape =(num_samples, freq_bins,time_steps, 2)
;Phase
will return the phase of the STFT reuslt, shape =(num_samples, freq_bins,time_steps, 2)
. The complex number is stored as(real, imag)
in the last axis. Default value is ‘Magnitude’.verbose (bool) – If
True
, it shows layer information. IfFalse
, it suppresses all prints
- Returns
spectrogram (torch.tensor)
It returns a tensor of spectrograms.
shape =
(num_samples, freq_bins,time_steps)
ifoutput_format='Magnitude'
;shape =
(num_samples, freq_bins,time_steps, 2)
ifoutput_format='Complex' or 'Phase'
;
Examples
>>> spec_layer = Spectrogram.CQT1992v2() >>> specs = spec_layer(x)
Methods
__init__
Initializes internal Module state, shared by both nn.Module and ScriptModule.
Convert a batch of waveforms to CQT spectrograms.
Method for debugging
- forward(x, output_format=None, normalization_type='librosa')¶
Convert a batch of waveforms to CQT spectrograms.
- Parameters
x (torch tensor) –
Input signal should be in either of the following shapes.
(len_audio)
(num_audio, len_audio)
3.
(num_audio, 1, len_audio)
It will be automatically broadcast to the right shapenormalization_type (str) –
Type of the normalisation. The possible options are:
’librosa’ : the output fits the librosa one
’convolutional’ : the output conserves the convolutional inequalities of the wavelet transform:
for all p ϵ [1, inf]
|| CQT ||_p <= || f ||_p || g ||_1
|| CQT ||_p <= || f ||_1 || g ||_p
|| CQT ||_2 = || f ||_2 || g ||_2
’wrap’ : wraps positive and negative frequencies into positive frequencies. This means that the CQT of a sinus (or a cosinus) with a constant amplitude equal to 1 will have the value 1 in the bin corresponding to its frequency.
- forward_manual(x)¶
Method for debugging