Structured Sparsity for Audio Signals
Introduction This webpage provides supplementary audio examples, visualizations, and source code for research results on structured sparsity applied to audio restoration and denoising. The basic idea of this work is to exploit the dependencies of time-frequency coefficients to obtain more regulated and reliable sparse representations of audio signals. The work has been accomplished in the framework of the AudioMiner project at the Numerical Harmonic Analysis Group (University of Vienna), the Austrian Research Institute for Artificial Intelligence and in collaboration with the Laboratoire des Signaux et Systèmes at University Paris Sud.
Download the corresponding MATLAB toolbox for structured sparse estimation in WMDCT bases and Gabor-frames here: StrucAudioToolbox_V02. We warmly welcome any suggestions and comments concerning the toolbox! Note that the toolbox requires the installation of Peter Sondergard's linear time-frequency analysis toolbox (LTFAT), a fast general purpose MATLAB toolbox for time frequency analysis.
Examples of structured shrinkage applied to audio restoration are given on the page below (click here) as well as in the following publications. For examples on audio declipping, click here .
References
Kai Siedenburg ,
Matthieu Kowalski ,
and Monika Dörfler: "Audio Declipping with Social Sparsity". Submitted.
Kai Siedenburg and P. Depalle: “Modulation Filtering for Structured Time-Frequency Estimation of Audio Signals”. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 20-23, 2013, New Paltz, NY
Kai Siedenburg and
Monika Dörfler: “Persistent Time-Frequency Shrinkage for Audio Denoising” , Journal of the Audio Engineering Society (AES), No. 61 (1/2), Jan/Feb 2013
Matthieu Kowalski ,
Kai Siedenburg and
Monika Dörfler: "Social Sparsity! Neighborhood Systems Enrich Structured Shrinkage Operators" . IEEE Transactions on Signal Processing, 61(10), p. 2498-2511, May 2013
Kai Siedenburg :
"Persistent Empirical Wiener Estimation with Adaptive Threshold Selection for Audio Denoising" Proceedings of the 9th Sound and Music Computing Conference, July 11-14th 2012 Kopenhagen
Kai Siedenburg and
Monika Dörfler: "Audio Denoising by Generalized Time-Frequency Thresholding", Proceedings of the AES 45th Conference on Applications of Time-Frequency Processing in Audio, March 1-4 2012, Helsinki, Finland
Kai Siedenburg and
Monika Dörfler:
"Structured Sparsity for Audio Signals"
Proceedings of the 14th International Conference on
Digital Audio Effects, DAFx-11 September 19-23, 2011 Paris, France
Questions? Please contact us if have any comments and questions or if you wish to collaborate.
Audio Declipping with Social Sparsity
This section demonstrates how social sparse algorithms can be applied to the audio declipping problem, i.e. the recovery of truncated samples of audio files. For details and explanations, please see the above mentioned reference on declipping. The examples given on this page all use clipping level 0.2 while the clean signal has peak absolute value of 1. For the music signals we use a neighborhood extending 7 coefficients in time.
Audio Examples
Double Bass Solo:
Clean,
Clipped
Comparing Algorithms for Declipping:
More Examples
of an acoustic guitar signal declipped by social sparsity.
Acoustic Guitar:
Clean,
Clipped
Declipping with Social Sparsity:
Speech
is treated very similarly. Only the neighborhood should be chosen differently. Here we use 3 coefficients in time. See the paper for details.
Declipping with Social Sparsity:
Structured Sparse Denoising
Bird Calls
This section shows how structured sparsity can be applied for denoising audio signals and give an
overview of the landscape of structured shrinkage operators.
As an itroductory example, we start with an audio sample containing a bird call perturbed
by enviromental noise. We use a tight Gabor dictionary with a Hann-window of 1024 samples length, and hop size 256.
In this example, the sparsity levels were optimized heuristically. The fast iterative shrinkage
algorithms were terminated after 50 steps. The presented denoising operators are
Lasso, WGL, GL (group label=time), EL (group label=time), and PEL (group label=time)
with neighborhood size [1,1,1,1].
Original signal:
Original
Sparse Reconstructions:
Lasso ,
WGL ,
GL ,
EL ,
PEL .
Piano+Voice
In this experiment we consider the performance of Lasso and WGL based denoising with respect to different
time-frequency transforms. We use a windowed modified discrete cosine transform (WMDCT) with window
length 512, a Gabor-frame as above, an onset-based non-stationary Gabor (NSGT-t) scale-frame and a constant-q
resolved non-stationary Gabor frame (NSGT-f). We denoise a music signal of 5 sec length, containing piano and voice
peturbed by additive Gaussian white noise. Neighborhood size is [0,4,0,4], i.e. each neighborhood
is extending in both directions of time for four coefficients.
Original signals:
Clean,
Noisy
Sparse WMDCT Reconstruction:
WMDCT-Lasso ,
WMDCT-WGL
Sparse Gabor Reconstruction:
Stat-Gabor-Lasso ,
Stat-Gabor-WGL
Sparse non-stationary Gabor Reconstruction:
NSGT-t-Lasso ,
NSGT-f-Lasso ,
NSGT-f-WGL .
One can clearly hear how the neighborhood smoothing of WGL reduces "musical noise".
We finally show one the denoised time-frequency representations of one transform type, respectively.
Vinyl de-noising
In this last denoising example, we show how the WGL shrinkage can be used for decrackling a (quite famous) piano recording, which was perturbed with vinyl-crackles by hand.
By choosing a WMDCT basis with 4096 freq. channels and using the neighborhood size [0,0,0,4], we
aim at extracting the tonal parts of the recording. Of course, this also leads to a suppression of
the onset parts (that one can hear nicely in the residual), which has to be accepted when using
this method.
Original signals:
Clean,
Noisy
Sparse WMDCT Reconstruction:
Denoised,
Residual.
Structured Sparse Multilayer Decomposition
Jazz-Trio Recording
We use the structured shrinkage operators to decompose an exerpt of a jazz-trio record
into a tonal and transient layer.
The tonal layer is sparsely represented in a WMDCT dictionary with long support (4096).
Transients are sparsely represented in the same dictionary with short support (64). Using the BCR-algorithm,
we support the usual decomposition scheme by using structured shrinkage operators. For the tonal layer, we use WGL
with neighborhood [0,4,0,4], and for the transients we use the group Lasso with group labels in time,
leading to transients being preserved in broad band.
Original signal:
Original
Reconstructed Layers:
Tonal ,
Transient ,
Residual ,
Tonal+Transient .
The Glockenspiel
In a setup of "informed analysis", we process the famous glockenspiel signal by using the group-Lasso
and compose it into a tonal and transient layer.
The informed part of the analysis is introduced by an onset detection, which helps to define
tonal groups of coefficients. Groups are in this case defined as frequency indices between the onsets.
So we hope to just extract the relevant partials in their whole length using the group-Lasso. For the
transient layer we use a group Lasso also, with group labels as frequency indices. We employ Gabor-frames
with overlap 4 and window length 4096 for the tonal layer and 256 for the transient.
Original signal:
Original
Reconstructed Layers:
Tonal ,
Transient ,
Residual ,
Tonal+Transient .
Structured Sparse Neighborhood-Shapes
The Glockenspiel revisited
We can define a variety of shapes of the employed time-frequency neighborhoods to support the sparsity of the structures we are "looking" for. Here, we show preliminary results of the influence
of the neighborhood shape for the persistent group Lasso. We use it to extract the transient layer of the Glockenspiel-signal (also processed in the previous section). Consider the following figures which show an onset of the glockenspiel signal with the neighborhoods [0,0,0,0] (plain GL), [0,3,0,3] (symmetric PGL), [0,0,0,3] (asymmetric PGL), [0,3,0,0] (asymmetric PGL).
Transient estimations
GL ,
PGL-[0,3,0,3] ,
PGL-[0,3,0,0] ,
PGL-[0,0,0,3] .
Audio Denoising by Generalized Time-Frequency Thresholding
The Test Signals
In the research paper, 6 test different test signals, 3 noise levels and 4 different operators were used for the perceptually oriented evaluation. Here, we present the results with preselected (rather well balanced) sparsity levels. WGL refers to the Windowed Group Lasso Thresholding, WGLS to the WGL-James-Stein-shrinkage operator combination (see the paper for details). The letter "h" refers to the horizontally oriented WGL, "v" to the rather vertically oriented WGL-operator. "Block-T" refers to the Block-Thresholding Algorithm by Yu, Mallat and Bacry. Clean Test Signals:
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Noise Level: 0dB SNR Noisy Test Signals
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, WGL--h
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, WGLS-h
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, WGLS-v
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, Block--T
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Noise Level: 10dB SNR Noisy Test Signals
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, WGL--h
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, WGLS-h
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, WGLS-v
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, Block--T
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Noise Level: 20dB SNR Noisy Test Signals
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, WGL--h
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, WGLS-h
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, WGLS-v
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Denoised, Block--T
Strings ,
Piano ,
Perc. ,
VoiceMale ,
VoiceFemale ,
Jazz .
Lasso
Empirical Wiener
Windowed Group Lasso
Persistent Empirical Wiener
Hard Thresholding (Kitic et al. 2013)
Orthogonal Matching Pursuit (Adler et al. 2011)
Lasso
Windowed Group Lasso
Empirical Wiener
Persistent Empirical Wiener
Lasso
Empirical Wiener
Persistent Empirical Wiener
Whereas the symmetric neighborhoods naturally captures parts before and after the attacks (or rather time-points of maximum energy), the asymmetric ones rather retain components before (resp. after) the attacks. The orientation of the neighborhood therefore systematically promotes
the preservation of different temporal segments of the signal. For the estimation of transients, choosing the right neighborhood can significantly decrease artifacts as pre-echo.