Structured Sparsity for Audio Signals


Introduction This webpage provides supplementary audio examples, visualizations, and source code for research results on structured sparsity applied to audio restoration and denoising. The basic idea of this work is to exploit the dependencies of time-frequency coefficients to obtain more regulated and reliable sparse representations of audio signals. The work has been accomplished in the framework of the AudioMiner project at the Numerical Harmonic Analysis Group (University of Vienna), the Austrian Research Institute for Artificial Intelligence and in collaboration with the Laboratoire des Signaux et Systèmes at University Paris Sud.

Download the corresponding MATLAB toolbox for structured sparse estimation in WMDCT bases and Gabor-frames here: StrucAudioToolbox_V02. We warmly welcome any suggestions and comments concerning the toolbox! Note that the toolbox requires the installation of Peter Sondergard's linear time-frequency analysis toolbox (LTFAT), a fast general purpose MATLAB toolbox for time frequency analysis.

Examples of structured shrinkage applied to audio restoration are given on the page below (click here) as well as in the following publications. For examples on audio declipping, click here .

References

Kai Siedenburg , Matthieu Kowalski , and Monika Dörfler: "Audio Declipping with Social Sparsity". Submitted.

Kai Siedenburg and P. Depalle: “Modulation Filtering for Structured Time-Frequency Estimation of Audio Signals”. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 20-23, 2013, New Paltz, NY

Kai Siedenburg and Monika Dörfler: “Persistent Time-Frequency Shrinkage for Audio Denoising” , Journal of the Audio Engineering Society (AES), No. 61 (1/2), Jan/Feb 2013

Matthieu Kowalski , Kai Siedenburg and Monika Dörfler: "Social Sparsity! Neighborhood Systems Enrich Structured Shrinkage Operators" . IEEE Transactions on Signal Processing, 61(10), p. 2498-2511, May 2013

Kai Siedenburg : "Persistent Empirical Wiener Estimation with Adaptive Threshold Selection for Audio Denoising" Proceedings of the 9th Sound and Music Computing Conference, July 11-14th 2012 Kopenhagen

Kai Siedenburg and Monika Dörfler: "Audio Denoising by Generalized Time-Frequency Thresholding", Proceedings of the AES 45th Conference on Applications of Time-Frequency Processing in Audio, March 1-4 2012, Helsinki, Finland

Kai Siedenburg and Monika Dörfler: "Structured Sparsity for Audio Signals" Proceedings of the 14th International Conference on Digital Audio Effects, DAFx-11 September 19-23, 2011 Paris, France

Questions? Please contact us if have any comments and questions or if you wish to collaborate.

Audio Declipping with Social Sparsity

This section demonstrates how social sparse algorithms can be applied to the audio declipping problem, i.e. the recovery of truncated samples of audio files. For details and explanations, please see the above mentioned reference on declipping. The examples given on this page all use clipping level 0.2 while the clean signal has peak absolute value of 1. For the music signals we use a neighborhood extending 7 coefficients in time.

Audio Examples

Double Bass Solo: Clean, Clipped

Comparing Algorithms for Declipping:
Lasso
Empirical Wiener
Windowed Group Lasso
Persistent Empirical Wiener
Hard Thresholding (Kitic et al. 2013)
Orthogonal Matching Pursuit (Adler et al. 2011)

More Examples of an acoustic guitar signal declipped by social sparsity.

Acoustic Guitar: Clean, Clipped

Declipping with Social Sparsity:
Lasso
Windowed Group Lasso
Empirical Wiener
Persistent Empirical Wiener

Speech is treated very similarly. Only the neighborhood should be chosen differently. Here we use 3 coefficients in time. See the paper for details.

Speech: Clean, Clipped

Declipping with Social Sparsity:
Lasso
Empirical Wiener
Persistent Empirical Wiener


Structured Sparse Denoising

Bird Calls This section shows how structured sparsity can be applied for denoising audio signals and give an overview of the landscape of structured shrinkage operators. As an itroductory example, we start with an audio sample containing a bird call perturbed by enviromental noise. We use a tight Gabor dictionary with a Hann-window of 1024 samples length, and hop size 256. In this example, the sparsity levels were optimized heuristically. The fast iterative shrinkage algorithms were terminated after 50 steps. The presented denoising operators are Lasso, WGL, GL (group label=time), EL (group label=time), and PEL (group label=time) with neighborhood size [1,1,1,1].

Original signal: Original

Sparse Reconstructions: Lasso , WGL , GL , EL , PEL .































Piano+Voice In this experiment we consider the performance of Lasso and WGL based denoising with respect to different time-frequency transforms. We use a windowed modified discrete cosine transform (WMDCT) with window length 512, a Gabor-frame as above, an onset-based non-stationary Gabor (NSGT-t) scale-frame and a constant-q resolved non-stationary Gabor frame (NSGT-f). We denoise a music signal of 5 sec length, containing piano and voice peturbed by additive Gaussian white noise. Neighborhood size is [0,4,0,4], i.e. each neighborhood is extending in both directions of time for four coefficients.

Original signals: Clean, Noisy

Sparse WMDCT Reconstruction: WMDCT-Lasso , WMDCT-WGL

Sparse Gabor Reconstruction: Stat-Gabor-Lasso , Stat-Gabor-WGL

Sparse non-stationary Gabor Reconstruction: NSGT-t-Lasso , NSGT-f-Lasso , NSGT-f-WGL .

One can clearly hear how the neighborhood smoothing of WGL reduces "musical noise". We finally show one the denoised time-frequency representations of one transform type, respectively.































Vinyl de-noising In this last denoising example, we show how the WGL shrinkage can be used for decrackling a (quite famous) piano recording, which was perturbed with vinyl-crackles by hand. By choosing a WMDCT basis with 4096 freq. channels and using the neighborhood size [0,0,0,4], we aim at extracting the tonal parts of the recording. Of course, this also leads to a suppression of the onset parts (that one can hear nicely in the residual), which has to be accepted when using this method.

Original signals: Clean, Noisy

Sparse WMDCT Reconstruction: Denoised, Residual.
















Structured Sparse Multilayer Decomposition

Jazz-Trio Recording We use the structured shrinkage operators to decompose an exerpt of a jazz-trio record into a tonal and transient layer. The tonal layer is sparsely represented in a WMDCT dictionary with long support (4096). Transients are sparsely represented in the same dictionary with short support (64). Using the BCR-algorithm, we support the usual decomposition scheme by using structured shrinkage operators. For the tonal layer, we use WGL with neighborhood [0,4,0,4], and for the transients we use the group Lasso with group labels in time, leading to transients being preserved in broad band.

Original signal: Original

Reconstructed Layers: Tonal , Transient , Residual , Tonal+Transient .

















The Glockenspiel In a setup of "informed analysis", we process the famous glockenspiel signal by using the group-Lasso and compose it into a tonal and transient layer. The informed part of the analysis is introduced by an onset detection, which helps to define tonal groups of coefficients. Groups are in this case defined as frequency indices between the onsets. So we hope to just extract the relevant partials in their whole length using the group-Lasso. For the transient layer we use a group Lasso also, with group labels as frequency indices. We employ Gabor-frames with overlap 4 and window length 4096 for the tonal layer and 256 for the transient.

Original signal: Original

Reconstructed Layers: Tonal , Transient , Residual , Tonal+Transient .
















Structured Sparse Neighborhood-Shapes

The Glockenspiel revisited We can define a variety of shapes of the employed time-frequency neighborhoods to support the sparsity of the structures we are "looking" for. Here, we show preliminary results of the influence of the neighborhood shape for the persistent group Lasso. We use it to extract the transient layer of the Glockenspiel-signal (also processed in the previous section). Consider the following figures which show an onset of the glockenspiel signal with the neighborhoods [0,0,0,0] (plain GL), [0,3,0,3] (symmetric PGL), [0,0,0,3] (asymmetric PGL), [0,3,0,0] (asymmetric PGL).
Whereas the symmetric neighborhoods naturally captures parts before and after the attacks (or rather time-points of maximum energy), the asymmetric ones rather retain components before (resp. after) the attacks. The orientation of the neighborhood therefore systematically promotes the preservation of different temporal segments of the signal. For the estimation of transients, choosing the right neighborhood can significantly decrease artifacts as pre-echo.

Transient estimations GL , PGL-[0,3,0,3] , PGL-[0,3,0,0] , PGL-[0,0,0,3] .















Audio Denoising by Generalized Time-Frequency Thresholding

The Test Signals In the research paper, 6 test different test signals, 3 noise levels and 4 different operators were used for the perceptually oriented evaluation. Here, we present the results with preselected (rather well balanced) sparsity levels. WGL refers to the Windowed Group Lasso Thresholding, WGLS to the WGL-James-Stein-shrinkage operator combination (see the paper for details). The letter "h" refers to the horizontally oriented WGL, "v" to the rather vertically oriented WGL-operator. "Block-T" refers to the Block-Thresholding Algorithm by Yu, Mallat and Bacry.

Clean Test Signals: Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Noise Level: 0dB SNR

Noisy Test Signals Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, WGL--h Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, WGLS-h Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, WGLS-v Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, Block--T Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Noise Level: 10dB SNR

Noisy Test Signals Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, WGL--h Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, WGLS-h Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, WGLS-v Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, Block--T Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Noise Level: 20dB SNR

Noisy Test Signals Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, WGL--h Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, WGLS-h Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, WGLS-v Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .

Denoised, Block--T Strings , Piano , Perc. , VoiceMale , VoiceFemale , Jazz .