nnAudio.Spectrogram.MelSpectrogram¶
- class nnAudio.Spectrogram.MelSpectrogram(sr=22050, n_fft=2048, win_length=None, n_mels=128, hop_length=512, window='hann', center=True, pad_mode='reflect', power=2.0, htk=False, fmin=0.0, fmax=None, norm=1, trainable_mel=False, trainable_STFT=False, verbose=True, **kwargs)¶
Bases:
torch.nn.modules.module.Module
This function is to calculate the Melspectrogram of the input signal. Input signal should be in either of the following shapes.
(len_audio)
(num_audio, len_audio)
(num_audio, 1, len_audio)
The correct shape will be inferred automatically if the input follows these 3 shapes. Most of the arguments follow the convention from librosa. This class inherits from
torch.nn.Module
, therefore, the usage is same astorch.nn.Module
.- Parameters
sr (int) – The sampling rate for the input audio. It is used to calculate the correct
fmin
andfmax
. Setting the correct sampling rate is very important for calculating the correct frequency.n_fft (int) – The window size for the STFT. Default value is 2048
win_length (int) – the size of window frame and STFT filter. Default: None (treated as equal to n_fft)
n_mels (int) – The number of Mel filter banks. The filter banks maps the n_fft to mel bins. Default value is 128.
hop_length (int) – The hop (or stride) size. Default value is 512.
window (str) – The windowing function for STFT. It uses
scipy.signal.get_window
, please refer to scipy documentation for possible windowing functions. The default value is ‘hann’.center (bool) – Putting the STFT keneral at the center of the time-step or not. If
False
, the time index is the beginning of the STFT kernel, ifTrue
, the time index is the center of the STFT kernel. Default value ifTrue
.pad_mode (str) – The padding method. Default value is ‘reflect’.
htk (bool) – When
False
is used, the Mel scale is quasi-logarithmic. WhenTrue
is used, the Mel scale is logarithmic. The default value isFalse
.fmin (int) – The starting frequency for the lowest Mel filter bank.
fmax (int) – The ending frequency for the highest Mel filter bank.
norm – if 1, divide the triangular mel weights by the width of the mel band (area normalization, AKA ‘slaney’ default in librosa). Otherwise, leave all the triangles aiming for a peak value of 1.0
trainable_mel (bool) – Determine if the Mel filter banks are trainable or not. If
True
, the gradients for Mel filter banks will also be calculated and the Mel filter banks will be updated during model training. Default value isFalse
.trainable_STFT (bool) – Determine if the STFT kenrels are trainable or not. If
True
, the gradients for STFT kernels will also be caluclated and the STFT kernels will be updated during model training. Default value isFalse
.verbose (bool) – If
True
, it shows layer information. IfFalse
, it suppresses all prints.
- Returns
spectrogram – It returns a tensor of spectrograms. shape =
(num_samples, freq_bins,time_steps)
.- Return type
torch.tensor
Examples
>>> spec_layer = Spectrogram.MelSpectrogram() >>> specs = spec_layer(x)
Methods
__init__
Initializes internal Module state, shared by both nn.Module and ScriptModule.
Set the extra representation of the module
Convert a batch of waveforms to Mel spectrograms.
- extra_repr() → str¶
Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.
- forward(x)¶
Convert a batch of waveforms to Mel spectrograms.
- Parameters
x (torch tensor) –
Input signal should be in either of the following shapes.
(len_audio)
(num_audio, len_audio)
3.
(num_audio, 1, len_audio)
It will be automatically broadcast to the right shape