This paper describes an audio source separation that is based on nonnegative matrix factorization (NMF) and expectation maximization (EM). For stable and high-performance separation, an effective auxiliary source separation that extracts source residuals and reprojects them onto proper sources is proposed by taking into account an ambiguous region among sources and a source’s refinement. Specifically, an additional NMF (model) is designed for the ambiguous region — whose elements are not easily represented by any existing or predefined NMFs of the sources. The residual signal can be extracted by inserting the aforementioned model into the NMF-EM-based audio separation. Then, it is refined by the weighted parameters of the separation and reprojected onto the separated sources. Experimental results demonstrate that the proposed scheme (outlined above) is more stable and outperforms existing algorithms by, on average, 4.4 dB in terms of the source distortion ratio.
Audio source separation from a multichannel mixture is a great research topic. At the same time, audio source separation is challenging since it is usually a mathematically ill-posed problem; one is required to have additional knowledge of the mixing process or source signals to obtain successful separation results [1]–[10]. Independent component analysis has been used for such a separation under the assumption of statistical independence of sources [5].For the problem of underdetermined source separation, probability model–based approaches with generalized Gaussian priors or l_{p}-norm minimizations have been applied to this problem in an attempt to solve it [6].Nonnegative matrix factorization (NMF) [11]–[13], which is useful in music transcription, was employed for audio source separation because it is suitable for use with polyphonic musical instruments [10]–[12]. In particular, NMF and expectation maximization (EM) were successfully incorporated to solve mathematically ill-posed source separation problems [8]. Additionally, a model parameter estimation procedure that uses an iterative generalized expectation maximization (GEM) algorithm was proposed with the incorporation of a priori knowledge [9], [14].However, l_{p}-norm minimization approaches, NMF-EM-based approaches, and general flexible approaches are not suitable in the case where a mixed audio signal includes an ambiguous area that is not easily represented by any particular sources. Moreover, wrong source separation in an ambiguous area can cause quality degradation of the separated sources.In this paper, a coarse-to-fine separation structure (namely, residual extraction) and reprojection schemes are proposed for stable and efficient NMF-EM-based audio source separation. In the residual extraction step, in consideration of an ambiguous area of sources, an auxiliary NMF is attached to the NMF-EM-based system, and the residual signal is extracted using the inserted additional NMF model. Next, the residual signal is split into two categories — the remaining source components (which have similar characteristics to the original sources) and the rest of the components — using the weighted parameters and NMF-EM-based separation. Then, the remaining source components are reprojected onto the separated sources at the residual extraction step.This paper is structured as follows. In Section II, the NMF-EM-based audio source separation is numerically explained. The proposed scheme of extracting and reprojecting the residuals is described in Section III. A comparison and analysis of the experimental results is given in Section IV. Finally, Section V concludes the paper with a summary of the proposed algorithm.
II. NMF-EM-Based Audio Source Separation
This section numerically describes the audio source separation using NMF [7]–[9], [11]. NMF is a popular data decomposition technique in the areas of machine learning; image and audio signal processing; and audio source separation [11], [13]. Audio source separation is performed by decomposing the power spectrogram of objects, which can be represented as two nonnegative matrices (matrix W for narrow spectral patterns and matrix H for the corresponding weights); thus an F × N audio power spectrogram, V, obtained by a short-time Fourier transform (STFT), can be expressed as follows:$$V\approx WH\text{\hspace{0.17em}},$$where W and H are F × K and K × N nonnegative matrices, respectively. F and N indicate frequency bin and time-frequency index dimension, respectively. The decomposed matrices can represent the characteristics of audio objects [7]–[9], [11], and they are commonly used to analyze music characteristics [11]. The factorization is usually performed via cost function minimization as$$\underset{W,\text{\hspace{0.17em}}H\text{\hspace{0.17em}}\ge \text{\hspace{0.17em}}0}{\mathrm{min}}\text{\Phi}\left(V|WH\right),$$where Φ(·) is a cost function defined as$$\text{\Phi}\left(V|WH\right)={\displaystyle \sum _{f\text{\hspace{0.17em}}=\text{\hspace{0.17em}}1}^{F}{\displaystyle \sum _{n\text{\hspace{0.17em}}=\text{\hspace{0.17em}}1}^{N}\varphi \left({V}_{fn}|{\left[WH\right]}_{fn}\right)}}\text{},$$where ϕ(x | y) is a scalar cost function composed of either a Euclidean distance, a Kullback–Leibler divergence, or an Itakura–Saito divergence. To solve the minimization problem, the EM or maximum likelihood algorithm can be applied using a statistical distribution that assumes a zero mean. In general, a mixed audio signal can be approximated in such a way that the audio sources are multiplied with an audio mixing system and additive noise is added. For example,$${x}_{fn}={A}_{f}{s}_{fn}+{b}_{fn},$$where
x fn = [ x 1,fn , ... , x I,fn ] T , s fn = [ s 1,fn , ... , s J,fn ] T ,
b fn = [ b 1,fn , ... , b I,fn ] T , and A f =[ a ij,f ] for 〈 i,j 〉∈I×J.
Here, I and J indicate the number of input channels and sources to be separated, respectively. Furthermore, x_{fn}, A_{f}, and s_{fn} denote the mixed audio signal, the mixing system, and the mixed sources, respectively. Note that in the case of the instantaneous matrix A_{inst}, the mixing system A_{f} is real-valued and shared between all of the frequency sub-bands; that is, A_{f} = A_{inst} for all A_{f}, A_{inst} ∈ R^{I×J}[8]; b_{fn} represents the noisy data, which is assumed to have a Gaussian distribution and covariance Σ_{b}, and is defined as$${b}_{i,fn}~{N}_{c}\left(0,{\sigma}_{i,f}^{2}\right)\text{\hspace{0.17em}\hspace{0.17em}with\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}{\text{\Sigma}}_{b,f}=\text{diag}\left({\left[{\sigma}_{i,f}^{2}\right]}_{i}\right).$$Here, diag(u) returns a square matrix with the elements of vector u on the main diagonal. In an audio source separation problem, each source (object) is represented as an NMF consisting of k_{j}-dimensional decomposition (k_{j} ≥ J).$$\begin{array}{l}{\left|{s}_{j,fn}\right|}^{2}\approx {W}_{j}{H}_{j},{s}_{j,fn}={\displaystyle \sum _{l\in {k}_{j}}{c}_{l,fn},{c}_{l,fn}={w}_{fl}{h}_{nl}^{\text{T}},}\hfill \\ {W}_{j}=\left[{w}_{f1},\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{...}\text{\hspace{0.17em}\hspace{0.17em}},\text{\hspace{0.17em}}{w}_{f{k}_{j}}\right],\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}{H}_{j}={\left[{h}_{n1},\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{...}\text{\hspace{0.17em}\hspace{0.17em}},\text{\hspace{0.17em}}{h}_{n{k}_{j}}\right]}^{\text{T}}.\hfill \end{array}$$The total dimension of the objects is k = #{k_{1}, ... , k_{J}}.The EM algorithm is widely adopted to solve the mathematically ill-posed problem and optimize the cost function. We let a complete data set and a parameter set be Z = (X, S) and θ = {A_{f}, W, H, Σ_{b}}, respectively. Here, X = (x_{fn})_{I} and S = (s_{fn})_{J}. Then, to solve the equation, we use the log-likelihood function [15], which is defined by$$\begin{array}{l}Q(\Theta ,{\Theta}^{i-1})=\text{E}\left[\mathrm{log}p(X,S|\Theta )|X,{\Theta}^{i-1}\right]\hfill \end{array}.$$The resulting criterion can be expressed as$$\begin{array}{l}{\Theta}^{*}=\underset{\Theta}{\mathrm{arg}\text{\hspace{0.17em}}\mathrm{max}}\text{\hspace{0.17em}}Q(\Theta ,{\Theta}^{i-1})\hfill \end{array},$$where Θ^{i−1} represents the current parameters used to evaluate expectation E[·], and Θ represents the new parameters that optimize Q to maximize the expectation. The log-likelihood function [8]–[9] for audio source separation can be expressed as$$\begin{array}{l}Q(\Theta ,{\Theta}^{i-1})\hfill \\ \text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}={\displaystyle \sum _{fn}\left[\text{log}\left|{\Sigma}_{b,fn}\right|+{({x}_{fn}-{A}_{f}{S}_{fn})}^{H}{\Sigma}_{b}{}^{-1}({x}_{fn}-{A}_{f}{S}_{fn})\right]}\hfill \\ \text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}+{\displaystyle \sum _{k}{\displaystyle \sum _{fn}\text{log}\left({w}_{fk}{h}_{kn}\right)+\frac{{\left|{x}_{fn}\right|}^{2}}{{w}_{kf}{h}_{kn}}.}}\hfill \end{array}$$The parameters [8]–[9] are simply computed by using the partial derivatives of the log-likelihood with respect to A, W, H, and ∑_{b} as$$\begin{array}{ll}{A}_{f}\hfill & ={R}_{XS,f}{R}_{SS,f}^{-1},\hfill \\ {\sum}_{b,f}\hfill & =\mathrm{diag}\left[{R}_{xx,f}-{A}_{f}{R}_{xx,f}-{R}_{xs,f}{A}_{f}-{A}_{f}{R}_{xx,f}{A}_{f}^{H}\right],\hfill \\ {w}_{f{k}_{j}}\hfill & =\frac{1}{N}{\displaystyle \sum _{n}\frac{{u}_{{k}_{j},fn}}{{h}_{{k}_{j}n}},}{h}_{kn}=\frac{1}{F}{\displaystyle \sum _{f}\frac{{u}_{{k}_{j},fn}}{{w}_{{k}_{j}n}}},\hfill \end{array}$$with$$\begin{array}{l}{R}_{xx,f}=\frac{1}{N}{\displaystyle \sum _{n}{x}_{fn}{x}_{fn}^{H},}\text{\hspace{0.17em}\hspace{0.17em}}{R}_{xs,f}=\frac{1}{N}{\displaystyle \sum _{n}{x}_{fn}{s}_{fn}^{H},}\hfill \\ {R}_{ss,f}=\frac{1}{N}{\displaystyle \sum _{n}{s}_{fn}{s}_{fn}^{H},\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}}{u}_{{k}_{j},fn}={\left|{c}_{{k}_{j},fn}\right|}^{2}.\hfill \end{array}$$
III. Two-Step Audio Source Separation: Residual Extraction and Reprojection Scheme
In audio source separation, a mixed audio signal, x, can be described by two concepts according to the type of source combinations, as shown in Fig. 1. When the audio sources (original objects) are disjoint (that is,
∩ j=1 J s j =∅
), the mixed signal is represented as shown in Fig. 1(a). On the other hand, the signal is represented as shown in Fig. 1(b) when an intersection among the audio sources exists; that is,
Concepts to represent audio source mixing: (a) case 1 with null intersection and (b) case 2 with intersection.
To successfully separate a mixed audio signal, it should be split into each object source. In the case of null intersection among sources, a point in the mixed signal region is exactly matched to a point in a source. However, a point in the mixed signal is mapped into at least two sources when an ambiguous region among sources exists. Thus, a wrong source-separation in the mixed signal region can cause quality degradation in the original sources because a source in the mixed signal can be incorrectly mapped to a different source.To improve the separation performance in the case where an ambiguous region exists, a coarse-to-fine structure-based audio separation (residual extraction and reprojection steps) is proposed as shown in Fig. 2. To extract a residual signal in an ambiguous region, an auxiliary NMF, W_{J+1}H_{J+1}, for the residual signal, which has difficulty in belonging to a particular single NMF of a source, is considered. By adding a model for the ambiguous area of a mixed signal, the number of objects, which are exchangeable with the sources or channels, for the NMF-EM-based separation is increased and a component of the mixing system is then added by inserting the auxiliary NMF. Then, the initial NMF parameters for the EM-based audio source separation are represented as$$\begin{array}{l}\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}{W}^{\prime}{H}^{\prime}=\left\{{W}_{1}{H}_{1},\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{...}\text{\hspace{0.17em}\hspace{0.17em}},{W}_{J}{H}_{J},{W}_{J+1}{H}_{J+1}\right\},\hfill \\ \text{\hspace{0.17em}}{A}^{\prime}=\left\{A,{a}_{i(J+1)}\right\}=\left[{a}_{ij}\right]\text{for}\langle i,j\rangle \in I\times (J+1).\hfill \end{array}$$
Residual extraction and reprojection-based separation structure.
The residual signal, which is not represented by any particular NMF of the sources, is assumed to be a random signal because an ambiguous region can belong to more than two sources.In this paper, NMF components in the ambiguous region, W_{J+1} and H_{J+1}, are generated by a normally distributed pseudorandom method [16]. Specifically, an NMF for an ambiguous region is modeled as a random signal with zero mean and certain variance, and its power spectrogram, |W_{J+1}H_{J+1}|, is adopted and illustrated (see Fig. 3).
In the first step, which is the residual extraction stage, a mixed audio signal is separated into source signals,
{ s 1 ′ , ... , s J ′ },
and a residual signal, r_{1}, using the NMF-EM-based separation. The separated signals contain the dominant components of each original source before mixing, and residual signal r_{1} contains a common set of sources that is difficult to split into the exact sources. To separate the residual signal r_{1} into exact residuals of the objects, a second separation is performed by the NMF-EM algorithm using the weighted initial parameters, which is called the “source residual projection stage.” To consider the estimated characteristics of objects from the mixed audio, the weighted initial parameters,
W n ′
and
H n ′
, are obtained from the weighted sum of the initial parameters, W′ and H′, and their updated parameters,
W u ′
and
H u ′
, in the previous step, as$${W}_{\text{n}}^{\prime}{H}_{\text{n}}^{\prime}={\omega}_{2}\times \left[{\omega}_{1}\left\{{W}^{\prime}{H}^{\prime}\right\}+\left(1-{\omega}_{1}\right)\left\{{W}_{\text{u}}^{\prime}{H}_{\text{u}}^{\prime}\right\}\right],$$where ω_{1} denotes the weighting value to combine the initial and updated parameters. In (13), ω_{2}, which is meant for consideration of the variation of the input signal, is defined by the ratio of the absolute power spectrograms of the mixed and residual signals, since the input for the second step is residual, as$${\omega}_{2}=\sqrt{\frac{\frac{1}{F\times N}{\displaystyle \sum _{f,n}\left|{X}_{f,n}\right|}}{\frac{1}{F\times N}{\displaystyle \sum _{f,n}\left|{R}_{{1}_{f,n}}\right|}}},$$where X_{f,n} and R_{1f,n} denote the power spectrograms of the mixed and residual signals, respectively.Finally, the mixed residual r_{1} is refined using the NMF-EM-based separation into the remaining signal of the audio sources, {r_{1,s1}, ... , r_{1,sJ}} (it can be interpreted as real source residue), and (mixed source region) residual signal (r_{2}). In other words, the remaining signals are sifted signals of the source residues by the second separation with parameters
W n ′ H n ′
.“Remaining signal” means a signal that is highly likely to belong to a source, in terms of statistical probability. To improve the separation performance by considering the refined residual signal, the sifted signals, r_{1,sj}, are projected onto the audio sources, s′_{j}, separated during the residual extraction step as$${s}_{j}={s}_{j}^{\prime}+{r}_{1,{s}_{j}}.$$
IV. Experiment
For the performance evaluation, we compare the proposed scheme with previous approaches — l_{p}-norm minimization approach (l_{p}NM) [6], NMF-EM-based approach (NMF-EM) [8], [17], and a general flexible framework–based algorithm (GFF) [9], [18]. For the test set, five 10 s clips and one 30 s clip from a variety of Korean pop (K-pop) songs, which were commercially recorded at 44.1 kHz with 16-bit resolution and stereo for object audio services in South Korea, are applied in the separation. The original objects of the given music files are independently provided, and the mixed signals from three objects — vocal, drum, and keyboard — of each K-pop song are generated by an instantaneous mixing system using a 2 × 3 matrix whose elements are in the range [0, 1]. Specifically, the fourth and fifth files are different sections of the same content. The fourth file has weak artificial effects from a vocoder and virtual sound technology [19], which are widely used for commercial music, and the fifth one has strong artificial effects for approximately 30% of the file length.For the quantitative evaluation, the source distortion ratio (SDR) and source-to-interference ratio (SIR) [20]–[21] of the separated result are measured against the original files. To generate the initial parameters for the given system, the original objects of the contents are applied to the NMF generation on the basis of the EM approach with 1,000 iterations; k_{j} for the objects is set to {6; 4; 4}. All of the parameters used in this test are summarized in Table 1. To be more specific, the elements of the initial mixing system are set equal to the same value (1/J), and the mixed signal is separated by the proposed approach with 20 iterations and 10 iterations for the two stages. Additionally, weight ω_{1} is set to 0.9 for the second stage. Note that the weight for the updated parameters is lower than that for the initial parameters because the updated parameters can have less accurate characteristics for objects that are estimated from the mixed audio at the first stage.
To compare SDR and SIR scores, the separation performances of the proposed algorithm for the three objects are shown in Tables 2 and 3, respectively. An SDR score is a measure that is used to evaluate the spatial distortion, interference, and artifacts at the initial equal weights, and an SIR score evaluates relative amounts of interference errors between original and recovered signals [20]–[21]. Higher scores indicate better separation performance. As shown in the SDR score comparison of the existing and proposed schemes, the vocal and drum objects show better separation performance than the keyboard object. In the case of the vocal object, the NMF-EM approach has a higher separation performance than the l_{p}NM and GFF approaches for the five files except the fifth file. The average SDR of the NMF-EM approach is approximately 8.2 dB except for the fifth file. The average SDR score including the entire contents is approximately 6.1 dB. The difference comes from the lower separation performance in the fifth file, as it contains strong artificial effects. In contrast, the average SDR of the proposed method is approximately 10.7 dB and 10.5 dB without and with the fifth content, respectively. The proposed method shows higher scores than the existing schemes, and it exhibits a stable separation performance even in the presence of strong artificial effects, regardless of content. Similarly, the proposed scheme outperforms the existing schemes with a stable performance in the drum and keyboard objects. The proposed residual reprojection scheme shows an average SDR increase in the vocal, drum, and keyboard objects by approximately 1.0 dB, 1.4 dB, and 1.5 dB, respectively. In the SIR score of vocal, the NMF-EM approach has the higher performance than the l_{p}NM and GFF approaches in the five files except the fifth file. Moreover, in the drum and keyboard objects, the NMF-EM outperforms the l_{p}NM and GFF schemes in all the test files. The average SIR scores of the NMF-EM approach give the best performance among the existing algorithms. The proposed method produces higher SIR scores for the three objects than the existing methods. When it is compared to the NMF-EM approach, the scores for the three objects are increased by approximately 6.6 dB, 5.2 dB, and 3.9 dB, respectively. It is noted that the residual extraction in the proposed algorithm is performed to refine the residues. Therefore, to analyze and evaluate the characteristics of the extracted residues, the cross correlation between sources is calculated by$${c}_{s}=\frac{\text{corr}\left(\left|{s}_{1}\right|,\left|{s}_{2}\right|\right)+\text{corr}\left(\left|{s}_{1}\right|,\left|{s}_{3}\right|\right)+\text{corr}\left(\left|{s}_{2}\right|,\left|{s}_{3}\right|\right)}{3},$$
SIR scores of proposed method and existing algorithms (dB).
Object
Term
1
2
3
4
5
6
Avg.
Vocal
l_{p}NM [5]
−4.6
6.2
−0.7
0.9
2.0
−4.6
−0.1
NMF-EM [7]
9.4
15.5
13.0
11.4
−4.7
10.3
9.2
GFF [8]
1.0
6.6
4.6
5.9
2.9
3.5
4.1
Proposed
15.9
16.8
16.3
17.1
14.1
12.4
15.4
15.2
17.6
16.3
17.0
16.5
12.4
15.8
Drum
l_{p}NM [5]
−1.8
1.4
−5.7
−2.2
5.2
−1.8
−0.8
NMF-EM [7]
14.0
8.3
17.7
17.0
9.8
13.4
13.4
GFF [8]
8.7
8.1
8.5
14.1
4.8
7.5
8.6
Proposed
23.9
23.5
18.1
20.7
12.0
17.0
19.2
22.8
23.0
17.8
20.6
11.0
16.6
18.6
Keyboard
l_{p}NM [5]
5.9
2.3
2.7
0.6
−8.0
5.9
1.6
NMF-EM [7]
7.1
3.4
4.1
1.0
−4.0
6.5
3.0
GFF [8]
−0.9
−3.2
−3.8
−5.5
−8.8
−3.1
−4.2
Proposed
9.4
5.0
4.4
6.9
8.5
7.0
6.9
8.7
6.1
4.4
6.8
9.4
6.2
6.9
where s_{1}, s_{2}, and s_{3} indicate the power spectrograms of the sources and corr(·) is the cross-correlation operator. Similarly, the cross-correlation between the residue and the source is measured as$${c}_{r,i}=\frac{\text{corr}\left(\left|{r}_{i}\right|,\left|{s}_{1}\right|\right)+\text{corr}\left(\left|{r}_{i}\right|,\left|{s}_{2}\right|\right)+\text{corr}\left(\left|{r}_{i}\right|,\left|{s}_{3}\right|\right)}{3}\text{for}i=1,2,$$where r_{i} indicates the residual power spectrogram during the residual extraction (i = 1) and projection (i = 2) steps. Table 4 shows that cross-correlation c_{r,1} is much higher than cross-correlation c_{s}. In other words, the sources have a non-zero correlation relative to each other in the case when the proposed concept is assumed for audio source mixing, and a residual signal from a high-correlation region between sources is extracted during the residual extraction step. Cross-correlation c_{r,2} decreases during the residual projection stage compared with c_{r,1}, which indicates that the level of uncertainty in the ambiguous region decreases because the inherent signal is reprojected from the mixed residue onto each source.
It is noted that a source code is available at http://mmc.cau.ac.kr/publications/publications.php.
V. Conclusion
This paper has presented an NMF-EM-based audio source separation scheme using residual extraction and reprojection. For this, a coarse-to-fine separation structure is used to consider the ambiguous area in sources. In particular, a residual signal was extracted by inserting an auxiliary NMF to represent a mixed signal area that is difficult to be represented by any particular NMF of the sources. Then, the residual signal was reprojected onto the separated sources. The experimental results for real commercial contents showed that the proposed audio source separation scheme could provide much higher performance than the state-of-the-art approaches. Additionally, the proposed method produced much more stable results even in a content generated with artificial sound effects. However, the proposed coarse-to-fine source separation structure may increase the computational complexity compared to the existing NMF-based approach when original sources have more correlations. Nevertheless, we believe that the proposed scheme can be a useful tool for audio source separation using the NMF-EM algorithm.
This work was partially supported by the IT R&D program of MSIP/KEIT (10044569) by the NRF funded by the Ministry of Education, Science and Technology (No. NRF-2014S1A5B6037633 and No. NRF-2014R1A2A1A11049986) and by the Chung-Ang University Research Grant in 2014.
BIO
ideafisher@keti.re.krChoongsang Cho received his BS degree in electronic engineering from Suwon University, Rep. of Korea, in 2006 and his MS degree in information and communications from Gwangju Institute of Science and Technology, Rep. of Korea. Since 2008, he has been working as a researcher at Multimedia IP Research Center, Korea Electronics Technology Institute, Seongnam, Rep. of Korea. Currently, he is pursuing his PhD degree in imaging engineering at the Graduate School of Advanced Imaging Science, Multimedia & Film, Chung-Ang University, Seoul, Rep. of Korea. His research interests include numerical signal processing, audio separation, digital holograms, and image segmentation.
jwkim@keti.re.krJe Woo Kim received his BS and MS degrees in control & instrumentation engineering from the University of Seoul, Rep. of Korea, in 1997 and 1999, respectively. In 1999, he joined the Korea Electronics Technology Institute, Seongnam, Rep. of Korea, where he was involved in the development of video codecs, video transcoders, and multi-view video systems. He is currently a managerial researcher at the Multimedia IP Research Center, Korea Electronics Technology Institute. His research interests include audio-visual codecs and their applications, in particular UHD systems.
Corresponding Authorsangkny@cau.ac.krSangkeun Lee received his BS and MS degrees in electronic engineering from Chung-Ang University, Seoul, Rep. of Korea, in 1996 and 1999, respectively. He received his PhD degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, USA, in 2003. He is an associate professor at the Graduate School of Advanced Imaging Science, Multimedia & Film, Chung-Ang University. From 2003 to 2008, he was a staff research engineer with the Digital Media Solutions Lab, Samsung Information Systems America, Irvine, CA, USA, where he was involved in the development of video processing and enhancement algorithms (DNIe) for Samsung’s HDTV. His current research and development interests include computer vision, digital video, and image processing; especially augmented reality, video analysis/synthesis, denoising, compression for HDTV and multimedia applications, and CMOS image sensors. He is a senior member of the IEEE.
Attias H.
“New EM Algorithm for Source Separation and Deconvolution with a Microphone Array,”
Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process.
Hong Kong, China
Apr. 6–10, 2003
297 -
300
Fevotte C.
,
Doncarli C.
2004
“Two Contributions to Blind Source Separation Using Time-Frequency Distributions,”
IEEE Signal Process. Lett.
11
(3)
386 -
389
DOI : 10.1109/LSP.2003.819343
Smragdis P.
2014
“Static and Dynamic Source Separation Using Nonnegative Factorizations: A Unified View,”
IEEE Signal Process. Mag.
31
(3)
66 -
75
DOI : 10.1109/MSP.2013.2297715
Ozerov A.
,
Vincent E.
,
Bimbot F.
2012
“A General Flexible Framework for the Handling of Prior Information in Audio Source Separation,”
IEEE Trans. Audio, Speech, Language Process.
20
(4)
1118 -
1133
DOI : 10.1109/TASL.2011.2172425
Smaragdis P.
2007
“Convolutive Speech Bases and Their Application to Supervised Speech Separation,”
IEEE Trans. Audio, Speech, Language Process.
15
(1)
1 -
12
DOI : 10.1109/TASL.2006.876726
Fevotte C.
,
Bertin N.
,
Durrieu J.-L.
2009
“Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis,”
Neural Comput.
21
(3)
793 -
830
DOI : 10.1162/neco.2008.04-08-771
Virtanen T.
2007
“Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria,”
IEEE Trans. Audio, Speech, Language Process.
15
(3)
1066 -
1074
DOI : 10.1109/TASL.2006.885253
Dempster A.P.
,
Laird N.M.
,
Rubin D.B.
1977
“Maximum Likelihood from Incomplete Data via EM Algorithm,”
J. Royal Statistic Soc. Series B (Methodological)
39
(1)
1 -
38
Ozerov A.
,
Vincent E.
,
Bimbot F.
Flexible Audio SourceSeparation Toolbox (FASST) Version 1.0 User Guide
http://bass-db.gforge.inria.fr/fasst/FASST_UserGuide_v1.pdf
Vincent E.
,
Gribonval R.
,
Fevotte C.
2006
“Performance Measurement in Blind Audio Source Separation,”
IEEE Trans. Audio, Speech, Language Process.
14
(4)
1462 -
1469
DOI : 10.1109/TSA.2005.858005
@article{ HJTODO_2015_v37n4_780}
,title={Audio Source Separation Based on Residual Reprojection}
,volume={4}
, url={http://dx.doi.org/10.4218/etrij.15.0114.1311}, DOI={10.4218/etrij.15.0114.1311}
, number= {4}
, journal={ETRI Journal}
, publisher={Electronics and Telecommunications Research Institute}
, author={Cho, Choongsang
and
Kim, Je Woo
and
Lee, Sangkeun}
, year={2015}
, month={Aug}
TY - JOUR
T2 - ETRI Journal
AU - Cho, Choongsang
AU - Kim, Je Woo
AU - Lee, Sangkeun
SN - 1225-6463
TI - Audio Source Separation Based on Residual Reprojection
VL - 37
PB - Electronics and Telecommunications Research Institute
DO - 10.4218/etrij.15.0114.1311
PY - 2015
UR - http://dx.doi.org/10.4218/etrij.15.0114.1311
ER -
Cho, C.
,
Kim, J. W.
,
&
Lee, S.
( 2015).
Audio Source Separation Based on Residual Reprojection.
ETRI Journal,
37
(4)
Electronics and Telecommunications Research Institute.
doi:10.4218/etrij.15.0114.1311
Cho, C
,
Kim, JW
,
&
Lee, S
2015,
Audio Source Separation Based on Residual Reprojection,
ETRI Journal,
vol. 4,
no. 4,
Retrieved from http://dx.doi.org/10.4218/etrij.15.0114.1311
[1]
C Cho
,
JW Kim
,
and
S Lee
,
“Audio Source Separation Based on Residual Reprojection”,
ETRI Journal,
vol. 4,
no. 4,
Aug
2015.
Cho, Choongsang
and
,
Kim, Je Woo
and
,
Lee, Sangkeun
and
,
“Audio Source Separation Based on Residual Reprojection”
ETRI Journal,
4.
4
2015:
Cho, C
,
Kim, JW
,
Lee, S
Audio Source Separation Based on Residual Reprojection.
ETRI Journal
[Internet].
2015.
Aug ;
4
(4)
Available from http://dx.doi.org/10.4218/etrij.15.0114.1311
Cho, Choongsang
,
Kim, Je Woo
,
and
Lee, Sangkeun
,
“Audio Source Separation Based on Residual Reprojection.”
ETRI Journal
4
no.4
()
Aug,
2015):
http://dx.doi.org/10.4218/etrij.15.0114.1311