Audio Source Separation Based on Residual Reprojection
Audio Source Separation Based on Residual Reprojection
ETRI Journal. 2015. Aug, 37(4): 780-786
Copyright © 2015, Electronics and Telecommunications Research Institute (ETRI)
  • Received : November 12, 2014
  • Accepted : May 11, 2015
  • Published : August 01, 2015
Export by style
Cited by
About the Authors
Choongsang, Cho
Je Woo, Kim
Sangkeun, Lee

This paper describes an audio source separation that is based on nonnegative matrix factorization (NMF) and expectation maximization (EM). For stable and high-performance separation, an effective auxiliary source separation that extracts source residuals and reprojects them onto proper sources is proposed by taking into account an ambiguous region among sources and a source’s refinement. Specifically, an additional NMF (model) is designed for the ambiguous region — whose elements are not easily represented by any existing or predefined NMFs of the sources. The residual signal can be extracted by inserting the aforementioned model into the NMF-EM-based audio separation. Then, it is refined by the weighted parameters of the separation and reprojected onto the separated sources. Experimental results demonstrate that the proposed scheme (outlined above) is more stable and outperforms existing algorithms by, on average, 4.4 dB in terms of the source distortion ratio.
I. Introduction
Audio source separation from a multichannel mixture is a great research topic. At the same time, audio source separation is challenging since it is usually a mathematically ill-posed problem; one is required to have additional knowledge of the mixing process or source signals to obtain successful separation results [1] [10] . Independent component analysis has been used for such a separation under the assumption of statistical independence of sources [5] .
For the problem of underdetermined source separation, probability model–based approaches with generalized Gaussian priors or lp -norm minimizations have been applied to this problem in an attempt to solve it [6] .
Nonnegative matrix factorization (NMF) [11] [13] , which is useful in music transcription, was employed for audio source separation because it is suitable for use with polyphonic musical instruments [10] [12] . In particular, NMF and expectation maximization (EM) were successfully incorporated to solve mathematically ill-posed source separation problems [8] . Additionally, a model parameter estimation procedure that uses an iterative generalized expectation maximization (GEM) algorithm was proposed with the incorporation of a priori knowledge [9] , [14] .
However, lp -norm minimization approaches, NMF-EM-based approaches, and general flexible approaches are not suitable in the case where a mixed audio signal includes an ambiguous area that is not easily represented by any particular sources. Moreover, wrong source separation in an ambiguous area can cause quality degradation of the separated sources.
In this paper, a coarse-to-fine separation structure (namely, residual extraction) and reprojection schemes are proposed for stable and efficient NMF-EM-based audio source separation. In the residual extraction step, in consideration of an ambiguous area of sources, an auxiliary NMF is attached to the NMF-EM-based system, and the residual signal is extracted using the inserted additional NMF model. Next, the residual signal is split into two categories — the remaining source components (which have similar characteristics to the original sources) and the rest of the components — using the weighted parameters and NMF-EM-based separation. Then, the remaining source components are reprojected onto the separated sources at the residual extraction step.
This paper is structured as follows. In Section II, the NMF-EM-based audio source separation is numerically explained. The proposed scheme of extracting and reprojecting the residuals is described in Section III. A comparison and analysis of the experimental results is given in Section IV. Finally, Section V concludes the paper with a summary of the proposed algorithm.
II. NMF-EM-Based Audio Source Separation
This section numerically describes the audio source separation using NMF [7] [9] , [11] . NMF is a popular data decomposition technique in the areas of machine learning; image and audio signal processing; and audio source separation [11] , [13] . Audio source separation is performed by decomposing the power spectrogram of objects, which can be represented as two nonnegative matrices (matrix W for narrow spectral patterns and matrix H for the corresponding weights); thus an F × N audio power spectrogram, V , obtained by a short-time Fourier transform (STFT), can be expressed as follows:
where W and H are F × K and K × N nonnegative matrices, respectively. F and N indicate frequency bin and time-frequency index dimension, respectively. The decomposed matrices can represent the characteristics of audio objects [7] [9] , [11] , and they are commonly used to analyze music characteristics [11] . The factorization is usually performed via cost function minimization as
min W,H0 Φ( V|WH ),
where Φ(·) is a cost function defined as
Φ( V|WH )= f=1 F n=1 N ϕ( V fn | [ WH ] fn )  ,
where ϕ ( x | y ) is a scalar cost function composed of either a Euclidean distance, a Kullback–Leibler divergence, or an Itakura–Saito divergence. To solve the minimization problem, the EM or maximum likelihood algorithm can be applied using a statistical distribution that assumes a zero mean. In general, a mixed audio signal can be approximated in such a way that the audio sources are multiplied with an audio mixing system and additive noise is added. For example,
x fn = A f s fn + b fn ,
x fn = [ x 1,fn  ,  ...  ,  x I,fn ] T , s fn = [ s 1,fn  ,  ...  ,  s J,fn ] T ,
b fn = [ b 1,fn  ,  ...  ,  b I,fn ] T ​, and  A f =[ a ij,f ]      for 〈 i,j 〉∈I×J.
Here, I and J indicate the number of input channels and sources to be separated, respectively. Furthermore, x fn , A f , and s fn denote the mixed audio signal, the mixing system, and the mixed sources, respectively. Note that in the case of the instantaneous matrix A inst , the mixing system A f is real-valued and shared between all of the frequency sub-bands; that is, A f = A inst for all A f , A inst R I×J [8] ; b fn represents the noisy data, which is assumed to have a Gaussian distribution and covariance Σ b , and is defined as
b i,fn ~ N c ( 0, σ i,f 2 )  with    Σ b,f =diag( [ σ i,f 2 ] i ).
Here, diag( u ) returns a square matrix with the elements of vector u on the main diagonal. In an audio source separation problem, each source (object) is represented as an NMF consisting of kj -dimensional decomposition ( kj J ).
| s j,fn | 2 W j H j , s j,fn = l k j c l,fn , c l,fn = w fl h nl T , W j =[ w f1 ,  ...  , w f k j ],    H j = [ h n1 ,  ...  , h n k j ] T .
The total dimension of the objects is k = #{ k 1 , ... , kJ }.
The EM algorithm is widely adopted to solve the mathematically ill-posed problem and optimize the cost function. We let a complete data set and a parameter set be Z = ( X , S ) and θ = { A f , W , H , Σ b }, respectively. Here, X = ( x fn ) I and S = ( s fn ) J . Then, to solve the equation, we use the log-likelihood function [15] , which is defined by
Q(Θ, Θ i1 )=E[ logp(X,S|Θ)|X, Θ i1 ] .
The resulting criterion can be expressed as
Θ * = argmax Θ Q(Θ, Θ i1 ) ,
where Θ i−1 represents the current parameters used to evaluate expectation E[·], and Θ represents the new parameters that optimize Q to maximize the expectation. The log-likelihood function [8] [9] for audio source separation can be expressed as
Q(Θ, Θ i1 )           = fn [ log| Σ b,fn |+ ( x fn A f S fn ) H Σ b 1 ( x fn A f S fn ) ]              + k fn log( w fk h kn )+ | x fn | 2 w kf h kn .
The parameters [8] [9] are simply computed by using the partial derivatives of the log-likelihood with respect to A , W , H , and ∑ b as
A f = R XS,f R SS,f 1 , b,f =diag[ R xx,f A f R xx,f R xs,f A f A f R xx,f A f H ], w f k j = 1 N n u k j ,fn h k j n , h kn = 1 F f u k j ,fn w k j n ,
R xx,f = 1 N n x fn x fn H ,    R xs,f = 1 N n x fn s fn H , R ss,f = 1 N n s fn s fn H ,    u k j ,fn = | c k j ,fn | 2 .
III. Two-Step Audio Source Separation: Residual Extraction and Reprojection Scheme
In audio source separation, a mixed audio signal, x , can be described by two concepts according to the type of source combinations, as shown in Fig. 1 . When the audio sources (original objects) are disjoint (that is,
∩ j=1 J s j =∅
), the mixed signal is represented as shown in Fig. 1(a) . On the other hand, the signal is represented as shown in Fig. 1(b) when an intersection among the audio sources exists; that is,
∩ j=1 J s j ≠∅
PPT Slide
Lager Image
Concepts to represent audio source mixing: (a) case 1 with null intersection and (b) case 2 with intersection.
To successfully separate a mixed audio signal, it should be split into each object source. In the case of null intersection among sources, a point in the mixed signal region is exactly matched to a point in a source. However, a point in the mixed signal is mapped into at least two sources when an ambiguous region among sources exists. Thus, a wrong source-separation in the mixed signal region can cause quality degradation in the original sources because a source in the mixed signal can be incorrectly mapped to a different source.
To improve the separation performance in the case where an ambiguous region exists, a coarse-to-fine structure-based audio separation (residual extraction and reprojection steps) is proposed as shown in Fig. 2 . To extract a residual signal in an ambiguous region, an auxiliary NMF, W J+1 H J+1 , for the residual signal, which has difficulty in belonging to a particular single NMF of a source, is considered. By adding a model for the ambiguous area of a mixed signal, the number of objects, which are exchangeable with the sources or channels, for the NMF-EM-based separation is increased and a component of the mixing system is then added by inserting the auxiliary NMF. Then, the initial NMF parameters for the EM-based audio source separation are represented as
                 W H ={ W 1 H 1 ,  ...  , W J H J , W J+1 H J+1 }, A ={ A, a i(J+1) }=[ a ij ]          for  i,j I×(J+1).
PPT Slide
Lager Image
Residual extraction and reprojection-based separation structure.
The residual signal, which is not represented by any particular NMF of the sources, is assumed to be a random signal because an ambiguous region can belong to more than two sources.
In this paper, NMF components in the ambiguous region, W J+1 and H J+1 , are generated by a normally distributed pseudorandom method [16] . Specifically, an NMF for an ambiguous region is modeled as a random signal with zero mean and certain variance, and its power spectrogram, | W J+1 H J+1 |, is adopted and illustrated (see Fig. 3 ).
PPT Slide
Lager Image
Power spectrogram of inserted NMF.
In the first step, which is the residual extraction stage, a mixed audio signal is separated into source signals,
{ s 1 ′ , ... ,  s J ′ },
and a residual signal, r 1 , using the NMF-EM-based separation. The separated signals contain the dominant components of each original source before mixing, and residual signal r 1 contains a common set of sources that is difficult to split into the exact sources. To separate the residual signal r 1 into exact residuals of the objects, a second separation is performed by the NMF-EM algorithm using the weighted initial parameters, which is called the “source residual projection stage.” To consider the estimated characteristics of objects from the mixed audio, the weighted initial parameters,
W n ′
H n ′
, are obtained from the weighted sum of the initial parameters, W ′ and H ′, and their updated parameters,
W u ′
H u ′
, in the previous step, as
W n H n = ω 2 ×[ ω 1 { W H }+( 1 ω 1 ){ W u H u } ],
where ω 1 denotes the weighting value to combine the initial and updated parameters. In (13), ω 2 , which is meant for consideration of the variation of the input signal, is defined by the ratio of the absolute power spectrograms of the mixed and residual signals, since the input for the second step is residual, as
ω 2 = 1 F×N f,n | X f,n | 1 F×N f,n | R 1 f,n | ,
where X f,n and R 1f,n denote the power spectrograms of the mixed and residual signals, respectively.
Finally, the mixed residual r 1 is refined using the NMF-EM-based separation into the remaining signal of the audio sources, { r 1,s1 , ... , r 1,sJ } (it can be interpreted as real source residue), and (mixed source region) residual signal ( r 2 ). In other words, the remaining signals are sifted signals of the source residues by the second separation with parameters
W n ′ H n ′
“Remaining signal” means a signal that is highly likely to belong to a source, in terms of statistical probability. To improve the separation performance by considering the refined residual signal, the sifted signals, r 1,sj , are projected onto the audio sources, s j , separated during the residual extraction step as
s j = s j + r 1, s j .
IV. Experiment
For the performance evaluation, we compare the proposed scheme with previous approaches — lp -norm minimization approach ( lp NM) [6] , NMF-EM-based approach (NMF-EM) [8] , [17] , and a general flexible framework–based algorithm (GFF) [9] , [18] . For the test set, five 10 s clips and one 30 s clip from a variety of Korean pop (K-pop) songs, which were commercially recorded at 44.1 kHz with 16-bit resolution and stereo for object audio services in South Korea, are applied in the separation. The original objects of the given music files are independently provided, and the mixed signals from three objects — vocal, drum, and keyboard — of each K-pop song are generated by an instantaneous mixing system using a 2 × 3 matrix whose elements are in the range [0, 1]. Specifically, the fourth and fifth files are different sections of the same content. The fourth file has weak artificial effects from a vocoder and virtual sound technology [19] , which are widely used for commercial music, and the fifth one has strong artificial effects for approximately 30% of the file length.
For the quantitative evaluation, the source distortion ratio (SDR) and source-to-interference ratio (SIR) [20] [21] of the separated result are measured against the original files. To generate the initial parameters for the given system, the original objects of the contents are applied to the NMF generation on the basis of the EM approach with 1,000 iterations; kj for the objects is set to {6; 4; 4}. All of the parameters used in this test are summarized in Table 1 . To be more specific, the elements of the initial mixing system are set equal to the same value (1/ J ), and the mixed signal is separated by the proposed approach with 20 iterations and 10 iterations for the two stages. Additionally, weight ω 1 is set to 0.9 for the second stage. Note that the weight for the updated parameters is lower than that for the initial parameters because the updated parameters can have less accurate characteristics for objects that are estimated from the mixed audio at the first stage.
Parameter settings for experiment.
Parameters Set value
Vocal NMF order 6
Drum NMF order 4
Keyboard NMF order 4
Common set NMF order 4
STFT window size 1,024
Iteration # of EM for initial model 1,000
To compare SDR and SIR scores, the separation performances of the proposed algorithm for the three objects are shown in Tables 2 and 3 , respectively. An SDR score is a measure that is used to evaluate the spatial distortion, interference, and artifacts at the initial equal weights, and an SIR score evaluates relative amounts of interference errors between original and recovered signals [20] [21] . Higher scores indicate better separation performance. As shown in the SDR score comparison of the existing and proposed schemes, the vocal and drum objects show better separation performance than the keyboard object. In the case of the vocal object, the NMF-EM approach has a higher separation performance than the lp NM and GFF approaches for the five files except the fifth file. The average SDR of the NMF-EM approach is approximately 8.2 dB except for the fifth file. The average SDR score including the entire contents is approximately 6.1 dB. The difference comes from the lower separation performance in the fifth file, as it contains strong artificial effects. In contrast, the average SDR of the proposed method is approximately 10.7 dB and 10.5 dB without and with the fifth content, respectively. The proposed method shows higher scores than the existing schemes, and it exhibits a stable separation performance even in the presence of strong artificial effects, regardless of content. Similarly, the proposed scheme outperforms the existing schemes with a stable performance in the drum and keyboard objects. The proposed residual reprojection scheme shows an average SDR increase in the vocal, drum, and keyboard objects by approximately 1.0 dB, 1.4 dB, and 1.5 dB, respectively. In the SIR score of vocal, the NMF-EM approach has the higher performance than the lp NM and GFF approaches in the five files except the fifth file. Moreover, in the drum and keyboard objects, the NMF-EM outperforms the lp NM and GFF schemes in all the test files. The average SIR scores of the NMF-EM approach give the best performance among the existing algorithms. The proposed method produces higher SIR scores for the three objects than the existing methods. When it is compared to the NMF-EM approach, the scores for the three objects are increased by approximately 6.6 dB, 5.2 dB, and 3.9 dB, respectively. It is noted that the residual extraction in the proposed algorithm is performed to refine the residues. Therefore, to analyze and evaluate the characteristics of the extracted residues, the cross correlation between sources is calculated by
c s = corr( | s 1 |,| s 2 | )+corr( | s 1 |,| s 3 | )+corr( | s 2 |,| s 3 | ) 3 ,
SDR scores of proposed method and existing algorithms (dB).
Object Term 1 2 3 4 5 6 Avg.
Vocal lpNM [5] 0.2 3.8 0.7 0.1 1.7 0.2 1.1
NMF-EM [7] 7.3 11.3 10.2 5.4 −4.3 6.6 6.1
GFF [8] 2.2 3.8 3.2 3.8 2.2 3.2 3.1
Proposed 9.7 10.0 11.1 9.4 6.8 9.8 9.5
10.2 11.8 11.2 10.6 9.3 9.9 10.5
Drum lpNM [5] −1.8 0.2 −0.6 −3.7 1.2 −1.8 −1.1
NMF-EM [7] 7.0 5.8 6.8 4.4 −5.6 1.0 3.2
GFF [8] 4.7 4.3 5.6 6.8 1.2 2.6 4.2
Proposed 10.0 9.5 8.6 6.7 2.8 5.9 7.3
11.1 9.8 8.7 9.8 5.4 7.4 8.7
Keyboard lpNM [5] −0.4 −3.2 −1.6 −8.8 −2.8 −0.4 −2.9
NMF-EM [7] 0.7 1.7 1.6 −0.3 −3.2 0.5 0.2
GFF [8] −3.6 −3.9 −6.3 −6.2 −4.6 −4.9 −4.9
Proposed 0.9 0.9 1.5 0.5 1.9 0.1 1.0
2.9 3.8 2.4 2.0 2.4 1.3 2.5
SIR scores of proposed method and existing algorithms (dB).
Object Term 1 2 3 4 5 6 Avg.
Vocal lpNM [5] −4.6 6.2 −0.7 0.9 2.0 −4.6 −0.1
NMF-EM [7] 9.4 15.5 13.0 11.4 −4.7 10.3 9.2
GFF [8] 1.0 6.6 4.6 5.9 2.9 3.5 4.1
Proposed 15.9 16.8 16.3 17.1 14.1 12.4 15.4
15.2 17.6 16.3 17.0 16.5 12.4 15.8
Drum lpNM [5] −1.8 1.4 −5.7 −2.2 5.2 −1.8 −0.8
NMF-EM [7] 14.0 8.3 17.7 17.0 9.8 13.4 13.4
GFF [8] 8.7 8.1 8.5 14.1 4.8 7.5 8.6
Proposed 23.9 23.5 18.1 20.7 12.0 17.0 19.2
22.8 23.0 17.8 20.6 11.0 16.6 18.6
Keyboard lpNM [5] 5.9 2.3 2.7 0.6 −8.0 5.9 1.6
NMF-EM [7] 7.1 3.4 4.1 1.0 −4.0 6.5 3.0
GFF [8] −0.9 −3.2 −3.8 −5.5 −8.8 −3.1 −4.2
Proposed 9.4 5.0 4.4 6.9 8.5 7.0 6.9
8.7 6.1 4.4 6.8 9.4 6.2 6.9
where s 1 , s 2 , and s 3 indicate the power spectrograms of the sources and corr(·) is the cross-correlation operator. Similarly, the cross-correlation between the residue and the source is measured as
c r,i = corr( | r i |,| s 1 | )+corr( | r i |,| s 2 | )+corr( | r i |,| s 3 | ) 3  for i=1,2,
where ri indicates the residual power spectrogram during the residual extraction ( i = 1) and projection ( i = 2) steps. Table 4 shows that cross-correlation c r,1 is much higher than cross-correlation cs . In other words, the sources have a non-zero correlation relative to each other in the case when the proposed concept is assumed for audio source mixing, and a residual signal from a high-correlation region between sources is extracted during the residual extraction step. Cross-correlation c r,2 decreases during the residual projection stage compared with c r,1 , which indicates that the level of uncertainty in the ambiguous region decreases because the inherent signal is reprojected from the mixed residue onto each source.
Cross-correlation between sources and residuals.
File # 1 2 3 4 5 6 Avg.
cs 0.16 0.13 0.20 0.11 0.10 0.18 0.15
cr,1 0.48 0.36 0.49 0.39 0.35 0.26 0.39
cr,2 0.32 0.24 0.35 0.25 0.22 0.12 0.25
It is noted that a source code is available at
V. Conclusion
This paper has presented an NMF-EM-based audio source separation scheme using residual extraction and reprojection. For this, a coarse-to-fine separation structure is used to consider the ambiguous area in sources. In particular, a residual signal was extracted by inserting an auxiliary NMF to represent a mixed signal area that is difficult to be represented by any particular NMF of the sources. Then, the residual signal was reprojected onto the separated sources. The experimental results for real commercial contents showed that the proposed audio source separation scheme could provide much higher performance than the state-of-the-art approaches. Additionally, the proposed method produced much more stable results even in a content generated with artificial sound effects. However, the proposed coarse-to-fine source separation structure may increase the computational complexity compared to the existing NMF-based approach when original sources have more correlations. Nevertheless, we believe that the proposed scheme can be a useful tool for audio source separation using the NMF-EM algorithm.
This work was partially supported by the IT R&D program of MSIP/KEIT (10044569) by the NRF funded by the Ministry of Education, Science and Technology (No. NRF-2014S1A5B6037633 and No. NRF-2014R1A2A1A11049986) and by the Chung-Ang University Research Grant in 2014.
Choongsang Cho received his BS degree in electronic engineering from Suwon University, Rep. of Korea, in 2006 and his MS degree in information and communications from Gwangju Institute of Science and Technology, Rep. of Korea. Since 2008, he has been working as a researcher at Multimedia IP Research Center, Korea Electronics Technology Institute, Seongnam, Rep. of Korea. Currently, he is pursuing his PhD degree in imaging engineering at the Graduate School of Advanced Imaging Science, Multimedia & Film, Chung-Ang University, Seoul, Rep. of Korea. His research interests include numerical signal processing, audio separation, digital holograms, and image segmentation.
Je Woo Kim received his BS and MS degrees in control & instrumentation engineering from the University of Seoul, Rep. of Korea, in 1997 and 1999, respectively. In 1999, he joined the Korea Electronics Technology Institute, Seongnam, Rep. of Korea, where he was involved in the development of video codecs, video transcoders, and multi-view video systems. He is currently a managerial researcher at the Multimedia IP Research Center, Korea Electronics Technology Institute. His research interests include audio-visual codecs and their applications, in particular UHD systems.
Corresponding Author
Sangkeun Lee received his BS and MS degrees in electronic engineering from Chung-Ang University, Seoul, Rep. of Korea, in 1996 and 1999, respectively. He received his PhD degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, USA, in 2003. He is an associate professor at the Graduate School of Advanced Imaging Science, Multimedia & Film, Chung-Ang University. From 2003 to 2008, he was a staff research engineer with the Digital Media Solutions Lab, Samsung Information Systems America, Irvine, CA, USA, where he was involved in the development of video processing and enhancement algorithms (DNIe) for Samsung’s HDTV. His current research and development interests include computer vision, digital video, and image processing; especially augmented reality, video analysis/synthesis, denoising, compression for HDTV and multimedia applications, and CMOS image sensors. He is a senior member of the IEEE.
Attias H. “New EM Algorithm for Source Separation and Deconvolution with a Microphone Array,” Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process. Hong Kong, China Apr. 6–10, 2003 297 - 300
Chun C.J. , Kim H.K. 2013 “Sound Source Separation Using Interaural Intensity Difference in Real Environments,” Proc. AES Convention
Bryan N.J. , Mysore G.J. “Interactive Refinement of Supervised and Semi-supervised Sound Source Separation Estimates,” IEEE Inter. Conf. Acoustics, Speech, Signal Process. Vancouver, Canada May 26–31, 2013 883 - 887
Fevotte C. , Doncarli C. 2004 “Two Contributions to Blind Source Separation Using Time-Frequency Distributions,” IEEE Signal Process. Lett. 11 (3) 386 - 389    DOI : 10.1109/LSP.2003.819343
Fu G.-S. 2014 “Blind Source Separation by Entropy Rate Minimization,” IEEE Trans. Signal Process. 62 (16) 4245 - 4255    DOI : 10.1109/TSP.2014.2333563
Vincent E. “Complex Nonconvex lp Norm Minimization for Underdetermined Source Separation,” Int. Conf. Ind. Compon. Anal. London, UK Sept. 9–12, 2007 430 - 437
Smragdis P. 2014 “Static and Dynamic Source Separation Using Nonnegative Factorizations: A Unified View,” IEEE Signal Process. Mag. 31 (3) 66 - 75    DOI : 10.1109/MSP.2013.2297715
Ozerov A. , Fevotte C. 2010 “Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation,” IEEE Trans. Audio, Speech, Language Process. 18 (3) 550 - 563    DOI : 10.1109/TASL.2009.2031510
Ozerov A. , Vincent E. , Bimbot F. 2012 “A General Flexible Framework for the Handling of Prior Information in Audio Source Separation,” IEEE Trans. Audio, Speech, Language Process. 20 (4) 1118 - 1133    DOI : 10.1109/TASL.2011.2172425
Smaragdis P. 2007 “Convolutive Speech Bases and Their Application to Supervised Speech Separation,” IEEE Trans. Audio, Speech, Language Process. 15 (1) 1 - 12    DOI : 10.1109/TASL.2006.876726
Fevotte C. , Bertin N. , Durrieu J.-L. 2009 “Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis,” Neural Comput. 21 (3) 793 - 830    DOI : 10.1162/neco.2008.04-08-771
Virtanen T. 2007 “Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria,” IEEE Trans. Audio, Speech, Language Process. 15 (3) 1066 - 1074    DOI : 10.1109/TASL.2006.885253
Lee D.D. , Seung H.S. 1999 “Learning the Parts of Objects by Nonnegative Matrix Factorization,” Nature 401 788 - 791    DOI : 10.1038/44565
Dempster A.P. , Laird N.M. , Rubin D.B. 1977 “Maximum Likelihood from Incomplete Data via EM Algorithm,” J. Royal Statistic Soc. Series B (Methodological) 39 (1) 1 - 38
Moon T.K. 2009 “Mathematical Methods and Algorithms for Signal Processing,” Prentice Hall NJ, USA
Moler C. 2004 “Numerical Computing with MATLAB,” Electronic Edition The Mathworks Natick, MA, USAv
Example Web Page
Ozerov A. , Vincent E. , Bimbot F. Flexible Audio SourceSeparation Toolbox (FASST) Version 1.0 User Guide
Bianchini R. , Cipriani A. 2000 “Virtual Sound: Sound Synthesis and Signal Processing-Theory and Practice with Csound,” ComTempo Rome, Italy
Vincent E. , Gribonval R. , Fevotte C. 2006 “Performance Measurement in Blind Audio Source Separation,” IEEE Trans. Audio, Speech, Language Process. 14 (4) 1462 - 1469    DOI : 10.1109/TSA.2005.858005
Vincent E. “First Stereo Audio Source Separation Evaluation Campaign: Data, Algorithms and Results,” Int. Conf. Independent Compon. Anal. Signal Separation London, UK Sept. 9–12, 2007 552 - 559