nnAudio.Spectrogram.STFT

class nnAudio.Spectrogram.STFT(n_fft=2048, win_length=None, freq_bins=None, hop_length=None, window='hann', freq_scale='no', center=True, pad_mode='reflect', fmin=50, fmax=6000, sr=22050, trainable=False, output_format='Complex', verbose=True, device='cpu')

Bases: torch.nn.modules.module.Module

This function is to calculate the short-time Fourier transform (STFT) of the input signal. Input signal should be in either of the following shapes.

  1. (len_audio)

  2. (num_audio, len_audio)

  3. (num_audio, 1, len_audio)

The correct shape will be inferred automatically if the input follows these 3 shapes. Most of the arguments follow the convention from librosa. This class inherits from torch.nn.Module, therefore, the usage is same as torch.nn.Module.

Parameters
  • n_fft (int) – The window size. Default value is 2048.

  • freq_bins (int) – Number of frequency bins. Default is None, which means n_fft//2+1 bins.

  • hop_length (int) – The hop (or stride) size. Default value is None which is equivalent to n_fft//4.

  • window (str) – The windowing function for STFT. It uses scipy.signal.get_window, please refer to scipy documentation for possible windowing functions. The default value is ‘hann’.

  • freq_scale ('linear', 'log', or 'no') – Determine the spacing between each frequency bin. When linear or log is used, the bin spacing can be controlled by fmin and fmax. If ‘no’ is used, the bin will start at 0Hz and end at Nyquist frequency with linear spacing.

  • center (bool) – Putting the STFT keneral at the center of the time-step or not. If False, the time index is the beginning of the STFT kernel, if True, the time index is the center of the STFT kernel. Default value if True.

  • pad_mode (str) – The padding method. Default value is ‘reflect’.

  • fmin (int) – The starting frequency for the lowest frequency bin. If freq_scale is no, this argument does nothing.

  • fmax (int) – The ending frequency for the highest frequency bin. If freq_scale is no, this argument does nothing.

  • sr (int) – The sampling rate for the input audio. It is used to calucate the correct fmin and fmax. Setting the correct sampling rate is very important for calculating the correct frequency.

  • trainable (bool) – Determine if the STFT kenrels are trainable or not. If True, the gradients for STFT kernels will also be caluclated and the STFT kernels will be updated during model training. Default value is False

  • verbose (bool) – If True, it shows layer information. If False, it suppresses all prints

  • device (str) – Choose which device to initialize this layer. Default value is ‘cpu’

Returns

spectrogram – It returns a tensor of spectrograms. shape = (num_samples, freq_bins,time_steps) if output_format='Magnitude'; shape = (num_samples, freq_bins,time_steps, 2) if output_format='Complex' or 'Phase';

Return type

torch.tensor

Examples

>>> spec_layer = Spectrogram.STFT()
>>> specs = spec_layer(x)

Methods

__init__

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward

Convert a batch of waveforms to spectrograms.

inverse

This function is same as the iSTFT() class, which is to convert spectrograms back to waveforms.

forward(x, output_format='Complex')

Convert a batch of waveforms to spectrograms.

Parameters
  • x (torch tensor) –

    Input signal should be in either of the following shapes.

    1. (len_audio)

    2. (num_audio, len_audio)

    3. (num_audio, 1, len_audio) It will be automatically broadcast to the right shape

  • output_format (str) – Control the type of spectrogram to be return. Can be either Magnitude or Complex or Phase. Default value is Complex.

inverse(X, onesided=True, length=None, refresh_win=True)

This function is same as the iSTFT() class, which is to convert spectrograms back to waveforms. It only works for the complex value spectrograms. If you have the magnitude spectrograms, please use Griffin_Lim().

Parameters
  • onesided (bool) – If your spectrograms only have n_fft//2+1 frequency bins, please use onesided=True, else use onesided=False

  • length (int) – To make sure the inverse STFT has the same output length of the original waveform, please set length as your intended waveform length. By default, length=None, which will remove n_fft//2 samples from the start and the end of the output.

  • refresh_win (bool) – Recalculating the window sum square. If you have an input with fixed number of timesteps, you can increase the speed by setting refresh_win=False. Else please keep refresh_win=True