A Design of Matching Engine for a Practical Query-by-Singing/Humming System with Polyphonic Recordings
A Design of Matching Engine for a Practical Query-by-Singing/Humming System with Polyphonic Recordings
KSII Transactions on Internet and Information Systems (TIIS). 2014. Feb, 8(2): 723-736
Copyright © 2014, Korean Society For Internet Information
  • Received : December 20, 2014
  • Accepted : January 21, 2014
  • Published : February 27, 2014
Export by style
Cited by
About the Authors
Seok-Pil, Lee
Department of Digital Media Technology, Sangmyung University Seoul, South Korea
Hoon, Yoo
Department of Digital Media Technology, Sangmyung University Seoul, South Korea
Dalwon, Jang
Korea Electronic Technology Institute Seoul, South Korea

This paper proposes a matching engine for a query-by-singing/humming (QbSH) system with polyphonic music files like MP3 files. The pitch sequences extracted from polyphonic recordings may be distorted. So we use chroma-scale representation, pre-processing, compensation, and asymmetric dynamic time warping to reduce the influence of the distortions. From the experiment with 28 hour music DB, the performance of our QbSH system based on polyphonic database is very promising in comparison with the published QbSH system based on monophonic database. It shows 0.725 in MRR(Mean Reciprocal Rank). Our matching engine can be used for the QbSH system based on MIDI DB also and that performance was verified by MIREX 2011.
1. Introduction
With the proliferation of digital music, efficient indexing and retrieval tools are required for searching the desired music in a large digital music database (DB). Music information retrieval (MIR) has been a matter of interest [1 , 2 , 3 , 4] . In the area of MIR, there are some challengeable tasks such as music genre classific, ation [5] , melody transcription [6 , 7 , 8] , cover-song identification [9] , tempo estimation [10] , audio identification [11 , 12 , 13 , 14 , 15] , query-by-singing/humming (QbSH) [16 , 17 , 18 , 19 . 20 , 21 , 22 , 23] , and so on. Among the various tasks, QbSH is differentiated from other tasks in that it has users’ acoustic query as an input. Conventional QbSH systems generally have been developed with DB constructed from MIDI files. Since most QbSH systems use pitch sequences, automatic transcription process is needed for the QbSH system based on polyphonic recordings, but automatically transcribed pitch sequences from polyphonic recordings are not reliable as the sequences from MIDI files [24 , 25 , 26 , 27] . Thus the accuracy of a QbSH system is higher when using MIDI files than polyphonic recordings. However manual transcription is a laborious and time consuming work.
This paper proposes a matching engine for a practical QbSH system with polyphonic recordings like MP3 files as well as monophonic recordings like MIDI files. Various automatic melody transcription algorithms are being developed, but the transcriptions are not yet perfect. Fig. 1 shows three different processes to obtain pitch sequences and the processes can be thought as channels where the pitch sequence is distorted. Sequence A which is extracted from manually transcribed MIDI files is not distorted, but sequence B and C is distorted by respectively user’s process and automatic transcription. The QbSH system based on polyphonic recordings should match sequence B and C while the system based on MIDI files should match sequence A and B.
PPT Slide
Lager Image
Three different pitch sequences from a melody: Sequence A is extracted from MIDI files, sequence B is extracted from a user’s singing/humming and sequence C is extracted from a polyphonic music.
The main contribution of this paper is a design of a matching engine considering the inaccuracies. We use previously developed algorithms for the pitch transcription [6 , 7 , 8] . The matching engine of our QbSH system is based on the asymmetric dynamic time warping (DTW) algorithm [28 , 29 , 30] . To reduce the influence of distortions on extracted pitch sequences, we use pre-processing, compensation.
Numerous MIR algorithms and systems are evaluated and compared annually under the controlled conditions from the music information retrieval evaluation exchange (MIREX) contest [1 , 2 , 31] . But there is not any contest for the QbSH system based on polyphonic recordings in MIREX. So we compare the performance of the proposed system with the existing system without chroma representation. Also we checked feasibility of our matching engine without chroma representation for the QbSH system based on MIDI files in MIREX 2011 [32] .
The rest of this paper is organized as follows. Section 2 presents the overall structure of our QbSH system. Section 3 and 4 explain the two processes of extracting pitch sequences from polyphonic and monophonic recordings and the proposed matching engine, respectively. Section 5 and 6 present experimental results and conclusions, respectively.
2. Overall Structure of the QbSH System
A list of symbols used in this paper is summarized in Table 1 .
A list of symbols
PPT Slide
Lager Image
A list of symbols
Fig. 2 shows the block diagram of our QbSH system. The processes shown in the left side of Fig. 2 should be performed offline and the processes shown in the right side of Fig. 2 are performed with the user’s query input. The feature DB is constructed from the MIDI DB or the music DB. The music DB contains a number of polyphonic recordings such as MP3 files. From the data in the DBs, a frame-based pitch sequence S(i)DB is extracted and the feature DB contains the pitch sequence and the name of music data ID(i) for i = 1, 2, • • • , I. From the user’s query, frame-based pitch sequence S q is extracted and the matching process based on DTW algorithm [18] finds a DB feature which is matched to S q . After that, the song name related to the feature is outputted.
PPT Slide
Lager Image
Overall structure of the QbSH system
3. Pitch Sequence Extraction
In this section, two pitch sequence extraction processes – from polyphonic music and from monophonic query - are briefly explained. A pitch extraction from MIDI is nothing but decoding MIDI. The two processes commonly start with decoding process since input is generally compressed by an audio codec such as MP3. The decoded signal is re-sampled at 8 kHz and framed with a 64 ms frame. For a frame, the frequency of pitch is obtained and converted into MIDI note number domain. Actually the round operator to get an integer number for converting from frequency domain to MIDI note number domain, but a real number is obtained without rounding in our implementation. Mathematically, converting is formulated as
PPT Slide
Lager Image
where M and f are a MIDI number and frequency, respectively [19] . However, the pitch sequence extracted from MIDI has an integer value of MIDI number.
- 3.1 Pitch Sequence Extraction from Monophonic Query
Since humming query contains uninformative signal such as silence, the voice-activity detection (VAD) algorithm is used for detecting informative frames. Improved minima controlled recursive averaging (IMCRA) algorithm, which is based on the ratio of spectrum of humming and noise signal, is used as a VAD algorithm [28] . The pitch value of 0 is assigned for the non-informative frame. After performing the VAD algorithm, a pitch value of a frame is determined by a method based on the spectro-temporal autocorrelation [29] . The temporal autocorrelation and spectral autocorrelation are computed and linearly combined.
- 3.2 Pitch Sequence Extraction from Polyphonic Music
A pitch is estimated based on average harmonic structure (AHS) of pitch candidates [30] . From the spectrum of polyphonic music, peaks are selected and the candidates of pitch are determined. Based on the AHS values of candidates and continuity of pitch sequence, a pitch value of a frame is estimated. For polyphonic music, the accuracy of voicing detection is not as reliable as the VAD algorithm for monophonic query, thus a pitch is estimated every frame without voicing detection. The frame is ignored when it is silence. This leads to the temporal distortion of the pitch sequence, but the matching algorithm based on the DTW algorithm is robust against such distortion. It can degrade the performance of QbSH system when the silence lasts very long, but generally silence lasts very short (<<1sec).
- 3.3 Melody Extraction
The main melody from polyphonic signal is the reference dataset for our query system. Multiple fundamental frequencies as called multi-F0 have to be calculated before estimating main melody from polyphonic music signal which has the various instrument sources plus singer’s vocal simultaneously. This topic has been researched by various papers for the last decade, but those articles have informed that estimating multi-F0 is not an easy task; especially when the accompaniment is stronger than main vocal. We can see easily that it happens at the current popular music like dance, rock, something else. Keeping in mind this situation, we propose the method of tracking the main melody from the multi-F0 with the harmonic structure which is very important fact of vocal signal. All of musical instruments have the harmonic structure as well as human vocal but percussion instruments.
The vocal region is detected with the zero crossing rate, frame energy, and the deviation of spectral peaks. We introduce the vocal enhancement module based on the multi frame processing and noise suppression algorithm to improve accuracy of vocal pitch. It is modified from adaptive noise suppression algorithm of IS-127 EVRC speech codec which has the advantage of enhanced performance with relative low complexity. The gain is calculated with SNR between input signal and noise level predicted by pre-determined method at each channel. Input signal is rearranged with this gain at each channel respectively. The noise suppressed input signal is obtained by inverse transformation.
The multi-F0 candidates are estimated from the predominant multiple pitch calculated by the harmonic structure analysis. The multi-F0 is decided by grouping the pitches into several sets by checking validation of its continuity and AHS. The melody is obtained by tracking the estimated F0. Voiced or unvoiced frame is determined on the preprocessing stage. If it is judged to unvoiced frame, it assures that F0 does not exist, otherwise doing harmonic analysis. Multi-F0 is estimated through three processing module like peak picking, F0 detection and harmonic structure grouping. There are some peak combinations with F0 because polyphonic signal is mixed with several musical instrument sources.
4. Matching Engine
- 4.1 Problems to Solve
A matching engine should be robust against the mismatch between S q and S(i)DB . As shown in Fig. 1 , two sequences are originated from a melody which can be written in a musical score. But, the information in a musical score which is almost perfectly transmitted to a polyphonic melody may be distorted in the process of extracting pitch sequences. Fig. 3 shows examples of pitch sequences and explains the tendency of distorted sequence. The pitch sequence from MIDI is clean and other sequences are distorted. In the following subsections, distortions induced from the processes are reviewed.
PPT Slide
Lager Image
Example of three different pitch sequences: (a) pitch sequence extracted from MIDI, (b) pitch sequence extracted from user’s query, and (c) pitch sequence extracted from polyphonic recording.
- 4.1.1 Inaccuracy of Singing/Humming
Commonly, users can not sing/hum as in a musical score. Most users sing/hum at inaccurate absolute/relative pitch with a wrong tempo. Sometimes, timing of each note is also wrong. Thus, before the pitch sequence extraction of QbSH system the information written in a musical score is distorted by users who sing/hum. The pitch sequence extraction process also leads to the distortion due to imperfect algorithm, but it is not as severe as the distortion due to the user. As shown in Fig. 3 (b), pitch values of a note is different from the original value shown in Fig. 3 (a), and they are unstable. Some notes last longer than other notes. However, user’s inaccurate singing/humming leads to the error in both pitch value and timing.
- 4.1.2 Inaccuracy of Extracting Pitch from Polyphonic Music
Polyphonic music includes main melody mixed with human voices and sounds of various instruments. This makes it difficult to extract the pitch sequence of the main melody. For example, the pitch sequence of accompaniment melody is extracted in the frames where the main melody does not exist. And in some frames where signal of accompaniment is strong, it is very difficult to find the pitch value of main melody even though the signals of main melody exist. Pitch doubling and pitch halving effects also lead to inaccurate pitch extraction. These are unavoidable since pitch extraction is based on detection of periodic signal.
- 4.2 Outline of Matching Engine
By matching S q and S(i)DB , the name of a song which is sung/ hummed can be found. The objective of matching engine can be mathematically formulated as follow:
PPT Slide
Lager Image
where it is assumed that S(i)DB for i = 1, 2, • • • , I is stored in DB. How to design the distance computed in matching engine d M (·) is the main concern of this paper. Information of the
PPT Slide
Lager Image
pitch sequence
PPT Slide
Lager Image
is outputted after deciding
PPT Slide
Lager Image
In the matching engine for the QbSH system based on polyphonic music DB, d M (·) is designed as
PPT Slide
Lager Image
where c and ψ(•) are respectively a compensation coefficient and a pre-processing function, which are explained in the following subsection. In the matching engine for the QbSH system based on MIDI DB, d M (·) is designed as
PPT Slide
Lager Image
Fig. 4 shows block diagrams of matching engines.
PPT Slide
Lager Image
Block diagrams of matching engines: (a) for the QbSH system based on polyphonic music DB and (b) for the QbSH system based on MIDI DB
- 4.3 Pre-processing
Pitch doubling and pitch halving effects lead to inaccurate pitch extraction. These are unavoidable since pitch extraction is based on detection of periodic signal. As in [7] , the accuracy of raw chroma is much higher than the accuracy of raw pitch. Thus, pitch doubling and halving effects make S(i)DB inaccurate. In the MIDI note number domain, these effects can change pitch values 12 up or down. To avoid these effects, the pitch value is divided by 12 and the remainder is used. ψ(•) is a function to output the remainder of input sequence and the pitch sequence is represented in chroma-scale. For the QbSH system based on MIDI DB, it is better not to use pre-processing since pitch sequence extracted from MIDI is not affected by pitch doubling and halving, the pre-processing cannot improve the performance of QbSH system. Use of ψ(•) leads to use of the cyclic function ξ(·). Assuming that there are two pitch values a and b, which satisfy the values of ψ(a) and ψ(b) are 11 and 0 respectively. The absolute difference between the two, |ψ(a)−ψ(b)| is 11, but it is reasonable that the difference between the two is 1 since we use the chroma scale. Thus, we introduce ξ(a) = min(a, 12 − a) and use ξ(|a − b|) instead of |a − b| with the pre-processing in the matching engine for the QbSH system based on polyphonic music DB.
- 4.4 Compensation
As mentioned in 4.1.1, it is not able to extract accurate absolute pitch from human and that is why we should compensate it. A compensation coefficient c is added to the pitch sequence as written in Equation (4) and the minimum distance is selected. Finding an appropriate c is very difficult because it is difficult to predict the DTW path. Thus, we use a brute-force search which selects a coefficient with minimum distance among candidate values.
Candidate values are selected after consideration of trade-off between performance and speed. For the matching engine of the QbSH system based on polyphonic DB, the range of c is limited to 0 ≤ c < 12 because of pre-processing. In our experiment, we used C 1 = {0, 1, 2, · · ·, 11} considering the execution time. The more compensations lead to the better results, but the more compensations also lead to the longer computational time. For the matching engine of the QbSH system based on MIDI DB, we used C 2 = avg( S(i)DB ) − avg( S q ) + {−5,−4,−3, · · · , 5}.
We concentrate on the variation of pitch domain using compensation. This is opposite to use of linear scaling [17] , which uses the brute-force search based on the variation of timing. It affects our system to reduce timing error with only DTW algorithm.
- 4.5 DTW Algorithm
DTW algorithm is widely used for QbSH system since it gives the robust matching results against local timing variation and inaccurate tempo. For ease of explanation, we denote P = [P(1)P(2) · · · P(K)] for an arbitrary vector P . We assume P = ψ(Sq+c), Q = ψ( S(i)DB ), L( S q ) = K, and L( S(i)DB ) = L. For DTW algorithm, we define a distance matrix D 1 = [D 1 (k, l)] and D 2 = [D 2 (k, l)] for k = 1, 2, · · · ,K and l = 1, 2, · · · ,L. The Procedure to compute d DTW ( P,Q ) for two sequences P and Q is summarized as follows.
DTW distance computation
Input Sequences are P, Q , and S q .
Do for k = 1, · · · ,K and l = 1, · · · ,L
PPT Slide
Lager Image
Do for l = 1, · · · ,L
  • D2(1,l) =D1(1,l)
Do for k = 2, · · · ,K
  • D2(k, 1) =D2(k-1, 1) +D1(k, 1)
Do for k = 2, · · · ,K and l = 2, · · · ,L
  • 1. If k == 2 orl== 2
  • D2(k,l) =D1(k, 1) + min(D2(k,l− 1),
  • D2(k − 1,l),D2(k − 1,l− 1)
  • 2. otherwise
  • D2(k,l) = min(D1(k,l) +D2(k − 1,l− 1),
  • α ×D1(k,l) +D2(k − 2,l− 1),
  • β ×D1(k,l) +D2(k − 1,l− 2)
PPT Slide
Lager Image
As in [18] , DTW path is asymmetric and the difference is represented by weighting coefficients α and β. When α = β = 1, the DTW algorithm is symmetric and every path has the same weight. In [18] , α = 2 and β = 1 are used, and this is reasonable since the weighted path passing two elements. Actually decision of the value of α and β is dependent on the query dataset. In our system, the distance is not computed for the frames where the pitch is not extracted. As explained in section 3, every element in S(i)DB has a pitch value, thus there is no element of 0. But there can be element of 0 in S q .
- 4.6 Distance Metric
The performance of QbSH system is dependent on d(·). In general, the absolute difference or the squared difference is used for QbSH system based on DTW algorithm. As written in Section 4.1.2, the pitch sequences in DB are distorted. The distance metric insensitive to the distortion can perform better. In our works, various distance metrics are investigated together with the absolute difference and the squared difference. They are mathematically formulated as follows:
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
This paper proposes the last three distances for QbSH system since they are less sensitive to the distortion. The first and second distances are generally used for distance measures and they are also compared in our experiment. The third distance metric limits the maximum value as λ. The fourth distance is based on log-scale and the fifth distance is based on sigmoid function.
5. Experiment
- 5.1 Polyphonic Case
- 5.1.1 Experimental Setup
Our DB contains 450 songs amounting to over 28 hours of audio data. The lengths of songs in DB differ from 34 sec to 396sec and the DB covers various genres - Rock, Dance, Carol, Children’s song and so on. For test set, 1000 query clips of about 10-12sec which are recorded from 32 volunteers, 14 women and 18 men, are used. In the test set, there are various parts of songs including the introduction and the climax. There are 298 humming query clips and 702 singing query clips in test set. Experiments were run on a desktop computer with i7-2600 CPU and 64bit Window 7. Our implementation is entirely written in C++.
As performance measures, we use the mean reciprocal rank (MRR) over top 20 returns, top 1 hit rate, top 10 hit rate and top 20 hit rate. The MRR is widely used for measuring the accuracy of the QbSH system in MIREX contest [8] and it is calculated by the reciprocal of the rank of the correctly identified song for each query [17] .
- 5.1.2 Results
We tested our system with various combination of α and β. Among them, it gives the best result when α and β are set to 3 and 1 respectively. The experimental results are summarized in Table 2 with various distance metrics. This result compares favorably to previously reported MRRs for 427 music recordings [19] and 140 music recordings [31] in spite of using polyphonic music. As shown in Table 2 , d (2) HINGE (·) outperforms among the distances proposed in this paper.
Results of the matching engine for the QbSH system based on polyphonic music DB
PPT Slide
Lager Image
Results of the matching engine for the QbSH system based on polyphonic music DB
To find out the influence of chroma representation, we test the matching engine without chroma representation and ξ(·). Table 3 shows the test result. As written in the Table 3 , use of chroma representation is very helpful for the QbSH system especially based on polyphonic DB.
Results with/without chroma representation and ξ(·)
PPT Slide
Lager Image
Results with/without chroma representation and ξ(·)
- 5.2 MIDI Case
We submitted our QbSH system based on MIDI files to MIREX 2011 [32] .
- 5.2.1 Experimental Setup
As presented in MIREX website, we use MRR and top 10 rate as performance measures [31] . For the proposed system, the results based on Roger Jang’s corpus, which consisting 4431 8-sec queries and 48 ground-truth MIDI files, are presented. As in MIREX task, 2000 Essen Collection MIDI noise files are added to DB [33] . For the system submitted in MIREX 2011, we present results in MIREX website. There are two different tasks based on MIDI dataset. One is based on Roger Jang’s corpus, and it is called Task 1a. The other is based on IOACAS corpus with 106 songs as ground truth and 355 humming queries, and it is called Task 1b.
- 5.2.2 Results
Table 4 shows the experimental results of the system we submitted to MIREX 2011 and other systems which were submitted in recent 3 years [31] . As written in Table 4 , our system gives moderate performance for Task 1a, but it gives the best performance for Task 1b.
Results of the system submitted to MIREX 2011: comparison to the other systems
PPT Slide
Lager Image
Results of the system submitted to MIREX 2011: comparison to the other systems
6. Conclusions
This paper proposes a practical QbSH system based on polyphonic recordings. When the DB is constructed from polyphonic recordings, it gets easier to create a large DB, but reliability of the data in DB gets worse. Considering the problems originated from pitch sequence extraction, the matching engine proposed in this paper is designed using techniques of chroma-scale representation, compensation, asymmetric dynamic time warping and the saturated distances. From the experiment with 28 hour music DB the results are very promising. Our matching engine can be used for the QbSH system based on MIDI DB also and that performance was verified by MIREX 2011. As a future work, we will focus on the design of more accurate pitch extractor using advanced learning algorithm.
This research was supported by a 2013 Research Grant from Sangmyung University.
Seok-Pil Lee received BS and MS degrees in Electrical Engineering from Yonsei University, Seoul, South Korea, in 1990 and 1992, respectively. In 1997, he earned a PhD degree in Electrical Engineering also at Yonsei University. From 1997 to 2002, he worked as a Senior Research Staff at Daewoo Electronics, Seoul, Korea. From 2002 to 2012, he worked as a head of Digital Media Research Center of Korea Electronics Technology Institute. He worked also as a Research Staff at Georgia Tech., Atlanta, USA from 2010 to 2011. He is currently a Professor at the Dept. of Digital Media Technology, Sangmyung University. His research interests include audio signal processing and multimedia searching
Hoon Yoo received BS, MS, and Ph. D degree in electronic communications engineering from Hanyang University, Seoul, Korea, in 1997, 1999, and 2003 respectively. From 2003 to 2005, he was with Samsung Electronics Co., Ltd., Korea. From 2005 to 2008, as an assistant professor, he was with Dongseo University, Busan, Korea. Since 2008, as an associate professor, he has been with Sangmyung University, Seoul, Korea. His research interests are in the area of multimedia processing.
Dalwon Jang received the B.S., M.S., and Ph.D degrees from Korea Advanced Institute of Science and Technology, in 2002, 2003, and 2010, respectively, all in electrical engineering. He is now with the Smart Media Research Center, Korea, Korea Electronics Technology Institute. His research interests include content identification, music information retrieval, multimedia analysis, and machine learning.
Orio Nicola 2006 “Music Retrieval: A Tutorial and Review” Foundations and Trends in Information Retrieval Article(CrossRefLink) 1 (1) 1 - 90    DOI : 10.1561/1500000002
Downie J. Stephen 2011 “The Music Information Retrieval Evaluation eXchange (MIREX) Next Generation Project” project prospectus
Typke R. , Wiering F. , Veltkam R. C. 2005 “A survey of music information retrieval systems” in Proc. of ISMIR 153 - 160
Tzanetakis G. , Essl G. , Cook P. 2001 “Automatic musical genre classification of audio signals” in Proc. of Int. Conf. Music Information Retrieval Bloomington, IN 205 - 210
Jang D. , Jin M. , Yoo C. D. 2008 “Music genre classification using novel features and a weighted voting method” in Proc. of ICME
Typke R. , Giannopoulos P. , Veltkamp R. C. , Wiering F. , Oostrum R. V. 2003 “Using transportation distances for measuring melodic similarity” in Proc. of Int. Conf. Music Information Retrieval 107 - 114
Poliner G. , Ellis D. , Ehmann A. , Gomez E. , Streich S. , Ong B. 2007 “Melody transcription from music audio: Approaches and evaluation” IEEE Trans. on Audio, Speech, Language Processing Article(CrossRefLink) 15 (4) 1247 - 1256    DOI : 10.1109/TASL.2006.889797
Jo S. , Yoo C. D. 2010 “Melody extraction from polyphonic audio based on particle filter” in Proc. of ISMIR
Ellis D. P.W. , Poliner G. E. 2007 “Identifying cover songs ith chroma features and dynamic programming beat racking” in Proc. of Int. Conf. Acoustic, Speech and Signal processing Honolulu, HI
Jang J. -S. R. , Lee H.-R. 2008 “A general framework of progressive filtering and its application to query by singing/humming” IEEE Trans. on Audio, Speech, and language Processing Article(CrossRefLink) 16 (2) 350 - 358    DOI : 10.1109/TASL.2007.913035
Seo J. S. , Jin M. , Lee S. , Jang D. , Lee S. , Yoo C. D. 2006 “Audio fingerprinting based on normalized spectral subband moments” IEEE Signal Processing letters Article(CrossRefLink) 13 (4) 209 - 212    DOI : 10.1109/LSP.2005.863678
Jang D. , Yoo C. D. , Lee S. , Kim S. , Kalker T. 2009 “Pairwise Boosted Audio Fingerprint” IEEE Trans. on Information Forensics and Security Article(CrossRefLink) 4 (4) 995 - 1004    DOI : 10.1109/TIFS.2009.2034452
Liu Y. , Cho K. , Yun H. S. , Shin J. W. , Kim N. S. 2009 “DCT based multiple hashing technique for robust audio finger printing” in Proc. of ICCASP
Cano P. , Batlle E. , Lalker T. , Haitsma J. 2005 “A review of audio fingerprinting” Journal of VLSI signal processing Article(CrossRefLink) 41 (3) 271 - 284    DOI : 10.1007/s11265-005-4151-3
Son W. , Cho H-T. , Yoon K. , Lee S-P 2010 “Sub-fingerprint masking for a robust audio fingerprinting system in a real-noise environment for portable consumer devices” IEEE Trans. on Consumer Electronics Article(CrossRefLink) 56 (1) 156 - 160    DOI : 10.1109/TCE.2010.5439139
Ghias A. , Logan J , Chamberlin D 1995 “Query by humming: musical information retrieval in an audio database” In Proc. of ACM Multimedia 231 - 236
Wang L. , Huang S. , Hu S. , Liang J. , Xu B. 2008 “An effective and efficient method for query by humming system based on multi-similarity measurement fusion” in Proc. of ICALIP
Yu H. M. , Tsai W. H. , Wan H. M. 2008 “A query-by-singing system for retrieving karaoke music” IEEE Trans. on multimedia 10 (8) 1626 - 1637    DOI : 10.1109/TMM.2008.2007345
Ryyn¨anen M. , Klapuri A. 2008 “Query by humming of MIDI and audio using locality sensitive hashing” in Proc. of ICASSP
Wu X. , Li M. 2006 “A top down approach to melody match in pitch contour for query by humming” in Proc. of International Symposium of Chinese Spoken Language Processing
Kim K. , Park K. R. , Park S. J. , Lee S. P. , Kim M. Y. 2011 “Robust Query-by-Singing/Humming System against Background Noise Environments” IEEE Trans. On Consumer Electronics Article(CrossRefLink) 57 (2) 720 - 725    DOI : 10.1109/TCE.2011.5955213
Song J. , Bae S. Y. , Yoon K. 2002 “Mid-level music melody representation of polyphonic audio for query by humming system” in Proc. of Int. Conf. Music Information Retrieval
Wang C. C. , Jang J-S. R. , Wang W. 2010 “An improved query by singing/humming system using melody and lyrics information” in Proc. of Int. Society for Music Information Retrieval Conf. 45 - 50
Klapuri A. P. 2003 “Multiple fundamental frequency estimation based on harmonicity and spectral smoothness” IEEE Trans. on Speech Audio Process Article(CrossRefLink) 11 (6) 804 - 816    DOI : 10.1109/TSA.2003.815516
Bishop C. M. 2006 Pattern recognition and machine learning Springer
Schapire S. , Singer Y. 1999 “Improoved boosting algorithms using confidence-rated predictions” Machine Learning Article(CrossRefLink) 37 (3) 297 - 336    DOI : 10.1023/A:1007614523901
Jang D. , Yoo C. D. , Kalker T. 2010 “Distance metric learning for content identification” IEEE Trans. on Information Forensics and Security Article(CrossRefLink) 5 (4) 932 - 944    DOI : 10.1109/TIFS.2010.2064769
Cohen I. 2003 “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging” IEEE Trans. on Speech and Audio Processing Article(CrossRefLink) 11 466 - 475    DOI : 10.1109/TSA.2003.811544
Cho Y. D. , Kim M. Y. , Kim S. R. 1998 “A spectrally mixed excitation (SMX) vocoder with robust parameter determination” in Proc. of ICASSP 601 - 604
Duan Z. , Zhang Y. , Zhang C. , Shi Z. 2008 “Unsupervised single-channel music source separation by average harmonic structure modeling” IEEE Trans. on Audio Speech Language Processing Article(CrossRefLink) 16 (4) 766 - 778    DOI : 10.1109/TASL.2008.919073
MIREX website. http://www.musicir. org/mirex/wiki/MIREX HOME
Jang D. , Lee S.-P. (2011) “Query by singing/humming system based on the combination of DTW distances for MIREX 2011”
Essen associative code and folk database