Computer aided tool for diagnosis of ENT pathologies using digital signal processing of speech and stroboscopic images
© Zorrilla et al; licensee Springer. 2012
Received: 1 September 2012
Accepted: 8 December 2012
Published: 13 December 2012
The development of computer software and other technologies greatly facilitates the evaluation of pathological voice patients. This fact allows to reduce exploration time, improves the reproducibility of results and creates the possibility of test protocol standardization needed for the intercommunication between the different voice specialists. The proposed application encompasses the most important aspects which should be taken into account regarding dysphonic patients. It is a multidimensional scope which involves subjective questionnaires and perceptual, aerodynamic, acoustic and stroboscopic evaluations. In this system, the authors have designed and created simple tools for recording and automatic acoustic analysis for the acquisition and edition of stroboscopic images. The purpose is to work with all necessary tools running on a single application, without having to export and import data from other computer programs. Therefore, the objective is to synthetize the basic voice and the exploration of the vocal folds, simplifying it through the design of a program which helps us to analyze step-by-step each aspect of the vocal pathology. The evaluation of the tool has been performed by the otolaryngologists through periodical (medical) appointments on 25 patients for one year a year, and the results are promising either for the professionals as well as for the patients which receive a detailed report with the objective information concerning the features of their voice and vocal cords.
Considering the number of people suffering voice pathologies: between 3% and 9% in USA (Nerrière et al. ), and 5% in Spain (SEORL, SEORL. http://www.seorl.net/. Accessed 26 [August 2012]), it is clear that these kinds of problems affect a very high percentage of the population. This is the reason why the authors believe that the study and development of techniques for the detection of vocal pathologies is important.
The advances in the area of diagnosis of otorhinolaryngology and speech therapy have been focusing in improving image acquisition devices, aimed at the observation of vocal folds and their functioning. Initially the ENT (Ear-Nose and Throat) specialist used a laryngeal mirror and performed the evaluation and diagnosis based on the information they could obtain with this device. Today we have other techniques for the acquisition and recording of images, which allow a posteriori evaluation, being Videolaryngostroboscopy (VLS) and Digital High-Speed videoendoscopy (HSV) the main ones.
It is important that the specialist knows the limitations the techniques used present, as VLS is only capable of capturing images at the speed of 25–30 frames/s (Lee et al. ). These limitations affect mainly to the diagnosis of movement related pathologies and to the analysis of vocal fold vibration cycle, which in the case of VLS images it is only a human eye optical illusion.
However, HSV (Kiritani et al. ; Kiritani et al. ) captures images at the speed of 5000–8000 frames/s, thus providing a great amount of physiological and movement-related vocal fold information. Its general use is limited. There are very few hospitals which own the necessary hardware and the price is very high, unlike VLS. There is an intermediate solution called Videokimography (VKG) which provides high resolution images at a high speed with a more accessible price than VLS (Kim et al. ).
Videolaryngostroboscopy is the most used technique for the diagnosis of some vocal pathologies (Braunschweig et al. ) and the most common in the hospitals. Due to that reason, we have chosen it for the purpose of this study.
Against this background, it is necessary to provide objective parameters and tools for the practice of ENT specialists, not only in the area of vocal pathology diagnosis but also in the evaluation of the evolution of a rehabilitation process or a chirurgic intervention.
ENTs as well as speech therapists use more information than that obtained merely through image acquisition for their daily practice. Acoustic analysis is another one of their information sources, and although it does not serve as a diagnostic method by itself ([Hadjitodorov and Mitev 2002]), the pitch, jitter, shimmer and harmonics-to-noise ratio (HNR) parameters are accepted for the evaluation of voice quality and to measure the efficiency of the rehabilitation. Both techniques are used for diagnosis and treatment, as in the majority of cases people go to ENT and speech therapist appointments due to hoarseness.
The main aim of this work is to speed up the habitual practice of the specialists during the appointments through the automation of study methods, in response to the lack of applications which include audio and video processing functionalities together with the patient´s psychological aspect and result analysis.
To design and develop a tool combining acoustic analysis and digital image processing which provides a report of final results according to the ELS (European Laryngological Society) guidelines.
The article is organized as follows: Section 2 describes materials, methods and functionalities of the designed and evaluated software. Section 3 shows the experimental results of the otolaryngologists in their daily practice using the application, and finally the authors offer concluding remarks in Section 4.
Materials and methods
In this section the technical and human resources used during the development of the application are described, together with the methodology followed in its design to provide the professional with a useful tool for the daily practice.
PC with integrated or external video capture device.
Microphone for PC
In this study, 25 patients have participated during a year and all the obtained data has been registered. 18 of the 25 patients included in the study were women (72%), while 7 were men (28%), with an average age of 42, being the youngest 19 and the oldest 58 (standard deviation 11,43). All the patients whose sessions (voices and images) are recorded in the Hospital have signed the ethical consent to be included in this study.
One of the aims of the developed tool is to provide the professional (in this case medical) with openness regarding where the data resides and how does the system work with it. In order to fulfill that purpose the interface has been designed as simple as possible, with a pleasant user-friendly visual aspect and easy to use for not technical potential users.
The design of the application (made in collaboration with otolaryngologists) encompasses all the tests to be done during an appointment for voice exploration, which are organized following a comprehensive sequential structure.
This design follows the structure proposed by the protocols (Teatinos) as well as other internationally accepted standards such as VHI (Rodríguez et al. Rodríguez-Parra et al. ), Dejonkere (ELS), all assumed by the SEORL (Spanish Society of Otorhinolaryngology and Cervico-facial Pathology) (Nuñez Batalla, Núñez ) in its Basic Protocol for the functional evaluation of vocal pathology. The Basic Protocol proposes five basic steps: subjective analysis, perceptual analysis, acoustic analysis, aerodynamic analysis and stroboscopy.
Respiratory diseases: chronic laryngitis and bronchitis, asthma, adenoids, sinusitis, tonsillitis.
Surgeries: tracheostomy, excisions of nodules or polyps, intubations.
Laryngeal trauma. Produced by yelling often.
Breathing and vocal misuse. By use of a faulty breathing technique, or the continued use of a tone that does not correspond to their organic characteristics.
Trauma: frigths, abandonment, accidents.
Characteristics of behavior: hyperactivity, shyness, inhibition.
Family and social environment: family screaming, sport competitions, recreation, environments with excessive noise.
Changes in hearing.
In order to store a full record of the patient, as in any other query, a patient anamnesis must be performed containing the following information: name and surname, age, sex, personal and family medical history, toxic habits. The case history data and the examinations are stored and classified according to the date of the medical visit.
The subjective analysis section includes the Voice Handicap Index (VHI) questionnaire −30 (Weigelt et al. ). It is a valid instrument for the assessment of the harm associated with dysphonia as perceived by the patient. The questionnaire is presented in short and full versions. In order to perform the numerical calculation of subjective analysis, this was divided into two general parameters: Impact and Vocal Quality.
The VHI contains 30 items organized into three groups of 10, called physical subscale, emotional subscale and functional subscale. It has been subsequently proved that these subscales are not separate measurements of speech impairment and that they have no validity as such.
Hsiung (Hsiung et al. ) studied the correlation between measures of the voice laboratory and the results of VIH in patients with dysphonia; a large discrepancy between the two assessments was evidenced, which served as the basis to infer that a patient's feelings about a voice problem cannot be evaluated through objective measures.
The short form questionnaire is composed of two questions: “How do you think is the quality of your voice?”, and “how does it affect your social life or work?”.
Voice Analysis is divided into the recording of the voice of the patient for the purpose of processing, and the calculation of the parameters of voice quality: pitch, jitter, shimmer and signal to noise ratio (García et al. ). Furthermore, a perceptual analysis is performed.
“Platero y Yo“ text
Held “a” letter. The patient must emit an “a” vowel with a moderate and clear voice tone. The acoustic analysis with the Jitter, Shimmer HNR and Pitch parameters is calculated on the basis of this recording.
Low-pitched “a”. The patient must emit the lowest pitch which the patient can produce. This recording is used to calculate the frequency range of the patient.
High-pitched “a”. The patient must emit the highest pitch which the patient can produce. This recording, together with the low-pitch data, is used to calculate the frequency range of the patient, measured in Hertz.
Projected voice. The patient must emit a continuous sentence with strength and clarity.
Sang voice. The patient must emit short words strongly.
At this point in the examination is not necessary to play the recording to process it, although it is recommended to listen to the recording quality before processing to ensure that it is correctly performed.
With the recordings carried out, the software is capable of calculating automatically the accoustic parameters of healthy patients, with functional pathologies, or even of laryngectomized patients.
As mentioned above, the images to be analyzed by the application are VLG images. Stroboscopy can represent a vocal fold vibratory motion by means of shooting fixed individual images through a flash of light at different points of the cycles of the vocal cords vibrating at a frequency much lower than their real frequency of oscillation.
The stroboscopic image processing can provide us with objective information concerning the size of the glottal space at a certain point in time or the size of a polyp provided that this is the pathology the patient is suffering. It is important to bear in mind that given the features of the images all of this kind of measures have to be undertaken in pixels, not in centimeters since no real references are taken.
Glottal closure. Improper closure of the vocal folds, i.e. that the free edges of both vocal folds do not meet in the midline at its maximum adduction.
Mucosal wave. The mucosal wave propagates from the subglottis and travels through the glottis to the upper surface of the vocal cord. It is a fine and delicate movement of the epithelium that spreads through the vocal cord. The normal mucosal wave is linear and continuously travels parallel to the edge of the vocal cord. Localized lesions can interfere with the mucosal wave propagation in the affected area.
Regularity. The normal oscillation of the vocal cord should appear as periodical during the production of a sustained vowel. It is a severity classification parameter in the function. The regularity may be altered in spasmodic dysphonia, polyps and vocal cord cancer.
Symmetry. It is the relation of vibration phase of a vocal cord related to the contralateral. The vocal cords are opened and closed symmetrically, i.e. vibrate in phase.
G-grade: measure of the overall severity of dysphonia.
R-rough, hoarse: measure of the hoarseness, the dysphonia which is caused by the absence of mucosal wave vibration and irregularities due to mass effect on the vocal cord.
Asthenic: measure of vocal fatigue, the inability to use the voice for long periods of time. The tone becomes lower pitched and the voice turns monotonous.
B-Breathy: measure of the air in the voice. It is produced by the leakage of air between the vocal cords by an incomplete glottal closure. This type of voice typically appears in the vocal cord paralysis.
S-Strain: measure of vocal effort, which corresponds to the hiperphonation or excessive tension of laryngeal muscles. It also produces difficulty in speaking, cervical muscles contraction, venous engorgement and mandibular projection. It is difficult to assess.
Each of the sections is rated on a scale of 0 to 3 points, where 0 would be normal, 1 mild, 2 moderate and 3 severe.
In general the parameters B and R, are easier to evaluate (audible), and typically indicate an organic pathology. A (anamnesis) and S (visual) are more difficult to evaluate, hence it is advisable to introduce only the GBR parameters.
Grade I: Irregular harmonic components mixed with noise components, especially in the region of the formants of vowels.
Grade II: Moderately hoarse voices with noise components in the second formants of vowels, predominant over the harmonic components, along with the appearance of some noise components in high frequency regions above 3000 Hz.
Grade III: The second formants of vowels are completely replaced by components of noise, while the high frequency noise is intensified.
Grade IV: There is loss of the periodic components of the first formants as a result of the present noise, and the high frequency noise is even more intensified.
Personal data analysis
Before assessing the results provided through/with the software, we review the grounds of the attendance to the consultation and the distribution by sex of the studied patients.
In this tests, 18 of the 25 patients included in the study were women (72%) and 7 men (28%) with an average age of 42 years and a range between 19 and 58 years (Standard Deviation 11.43).
Vocal use results
Exclusive or special use
Professional voice use
Workers without vocal use
Subjective assessment analysis
The subjective assessment performed by the patient is a medical concept that is increasingly more important when making decisions and is of particular relevance for voice specialists. We have included 2 questionnaires in this software.
Subjective assessment results
Subjective voice quality
Socio-laboral subjective repercusion
Classification of VHI in partial domains
VHI Classification in the overall assessment (sum of the partials)
Morpho-functional study by laryngostroboscopy
The examiner assessed subjectively the glottal closure alterations, mucosal wave, regularity, and symmetry, with a score between 0 and 10, based on zero as the normal parameter value. In case of incomplete glottal closure, up to 5 different types of incomplete closures were evaluated, whose images appear with the rest of the stroboscopic parameters. The laryngoscopic images were captured in the Step 3 of the ANALYSIS VOX software: Image Capture.
52% of patients with nodules (13 cases)
12% of patients with vocal polyp (3 cases)
16% of patients with Reinke’s edema (4 cases)
12% of patients with Sulcus vocalis (3 cases)
8% of patients without fuctional injury (2 cases)
Morpho-functional study results
No injury (functional)
The nodules and Reinke's edema are the injuries that have less impact on the symmetry parameter (Eysholdt et al. ).
Four of the 25 cases show complete closure (no glottal hiatus). The most frequent type of hiatus was the hourglass glottal hiatus (13 cases), followed by the posterior hiatus (4 cases) and the irregular (4 cases).
Right Size and Left Size.These parameters make reference to the size in pixels the pathology has, in the event of being one.
Deviation. This parameter meassures the angle formed with the right and left vocal fold.
Movement. Combines the measures carried out to parameterize the movement of the vocal folds (Méndez et al. Mendez et al. ).
Perceptual analysis of dysphonia
Perceptual analysis results
Patients rated according to the perceptual analysis of dysphonia
Rating from 0 to 10, from low to severe grade
Acoustic and spectrographic analysis
The vocal acoustic analysis is performed by emitting the letter /a/ in a comfortable way in tone and intensity, with the microphone at a distance of 15 cm and an angle of 45° in relation to the mouth.
We chose a homogeneous fragment of the recording (approximately 50 vibratory cycles) corresponding to a second of the middle portion of each vocal register and we proceeded to its analysis with the program included in our software tool.
Once the signal was converted from analogic to digital, the program calculated the following parameters: Fo, Jitter, Shimmer and Glottal Noise (HNR).
Acoustic analysis results
Acoustic parameter values of the 25 patients
Confidence Intervals (95%)
165,45 a 204,63
0,20 a 0,29
3,42 a 5,25
17,80 a 23,95
Regarding the results for spectrogram classification of the 25 records, Yanagihara criteria for classification, based on the absence of harmonics in the spectrum, signal void in the spectrogram without noise substitution and other pathological paths were considered.
The results were classified as follows: type I: 3 cases, type II: 6 cases, type III: 7 cases and type IV: 4 cases. Finally, three records were classified as normal.
Analysis of aerodynamic efficiency
The results of the measurement of aerodynamic efficiency shown that the mean of the maximum phonation time results (MFT) for the vowel /a/ was 11.42 seconds, with a range between 6 and 28 (standard deviation: 5.60). Values below 10 seconds should be considered pathological. Pathological cases are caused by 2 circumstances: either low volume bronchial pathology or by laryngeal pathology with glottic efficiency loss, as in the case of our patients.
Phonorespiratory Index: is calculated by MFT "s" / MFT"a". The normal limit is between 1.4 and 1.5. Values greater than 1.5 are related to defects of closure due to glottic incompetence. In 20 of the 25 cases, the phonorespiratory index was pathological (greater than 1.5), with a mean of 1.7 and a range between 1.4 and 1.9 (standard deviation 4.40).
It is a multidimensional diagram which represents the layout of the most important parameters of the analysis. Those who are within the normal threshold, appear within the demarcation of the red circle, while the pathological values appear located outside the threshold circle.
Thus a multidimensional graph of the patient is obtained in which the larger the area of the graph, the greater the vocal lesion, and the smaller the area of the diagram indicates a lower vocal pathology.
Stroboscopy: Mucosal wave
Stroboscopic Parameters: Models of stroboscopic images are shown with the scores of the various parameters.
Vocal Pathology: consist of laryngoscopic video recordings with their corresponding lesion diagnoses.
Perceptual Evaluation: includes audio recordings of pathological voices with evaluations of the GRBAS parameters.
Spectrograms: standard images consisting of different types of pathological spectrograms (I-IV) according to the criteria of Yanagihara.
Beside the objective results corresponding to the acoustic parameters analyzed, or related to the features of the vocal cords, the software has been reviewed and analyzed by experts in otolaryngology which this way help the authors in the continuous improvement of the application.
Survey made to otolaryngologists
The use of the AnalisisVOX software is easy
The acoustic parameters analyzed are appropriate
The graphic representation of the acoustic parameters is appropriate
The analyzed parameters of the vocal fold images are appropriate
The graphic representation of the parameters obtained automatically through the images is appropriate
Quality of the generated reports
Each one of the evaluated Items has been valued from 1 to 5, being 5 totally agree and 1 totally agree, and 1 totally.
As it can be seen in the Table, otolaryngologists evaluate especially well the graphic representations provided by the software (4,55 and 4,18), which allow to detect abnormalities in the voice and/ or in the vocal cords in a very visual way.
Regarding the ease of use of ANALISISVOX and its interface the evaluation remains good 3,53 and 3,14 respectively. The adaptation of the software to the management of the software is below one week, and the provided feedback is more related to the inclusion of new functionalities which, with problems related to those developed ones.
Comparative features of ANALISISVOX
Features of the current applications
Features of the ANALISISVOX application
Insufficient. They do not provide the complete information the specialist doctor needs when carrying out a medical consultation.
Complete. It is integrated within a single application, just the needed analyses the specialist doctor requires to diagnose (according to the ELS protocol).
Inefifcient. They do not provide the specific information necessary for the specialist doctor.
Efficient. An automated system is established, sequential, which makes the consultation easy and rapid /quick.
Lead to the specialist doctor to carry out the exploration in parts, gathering the information through different methods, generating files and documents of different programs which occupy usable memory of the computer, and moreover slowing down the process of the study.
It can be accessed to the files of any patient, without having to seek manually on the hard disk any kind of information
Too technical when showing the calculated information.
Shows the values obtained in a graphic and easy way to digest.
It is noticeable that there is great interest aroused by these applications in the field of ENT, speech pathologists and speech therapists, but the big question is how many of them respond to the needs of specialists. In the case of the proposed tool the vision of the experts who have participated throughout the development process has been implemented, therefore its applicability is granted.
The proposed application performs a post-processing of the signals supplied during a clinic session, and it unifies concepts and results which are usually analyzed independently. The preparation of all reports with all the provided objective parameters allows the study of the evolution of the patients during the rehabilitation process or after surgery. We could even evaluate the effectiveness of treatments and suggest modifications.
The authors wish to acknowledge the University of Deusto, which kindly lent infrastructures and material for this research work. This research is partially supported by the Basque Country Department of Education, Universities and Research. It is also important to remark the support provided by Ainara Sudupe.
- Braunschweig T, Flaschka J, Schelhorn-Neise P, Döllinger M: High-speed video analysis of the phonation onset, with an application to the diagnosis of functional dysphonias. Med Eng Phys 2008,30(1):59-66. 10.1016/j.medengphy.2006.12.007View ArticleGoogle Scholar
- Eysholdt U, Rosanowski F, Hoppe U: Vocal fold vibration irregularities caused by different types of laryngeal asymmetry. European Arch Otorhinolaryngol 2003,260(1):412-417.View ArticleGoogle Scholar
- García B, Ruiz I, Méndez A, Mendezona M: Objective characterization of oesophageal voice supporting medical diagnosis, rehabilitation and monitoring. Comput Biol Med 2009, 39: 97-105. 10.1016/j.compbiomed.2008.11.009View ArticleGoogle Scholar
- Gelzinis A, Verikas A, Bacauskiene M: Automated speech analysis applied to laryngeal disease categorization. Comput Methods Programs Biomed 2008,91(1):36-47. 10.1016/j.cmpb.2008.01.008View ArticleGoogle Scholar
- Hadjitodorov S, Mitev P: A computer system for acoustic analysis of pathological voices and laryngeal diseases screening. Med Eng Phys 2002,24(6):419-429. 10.1016/S1350-4533(02)00031-0View ArticleGoogle Scholar
- Hsiung M-W, Pai L, Wang H-W: Correlation between voice handicap index and voice laboratory measurements in dysphonic patients. Eur Arch Otorhinolaryngol 2002,259(2):97-99. 10.1007/s004050100405View ArticleGoogle Scholar
- Kim DY, Kim LS, Kim KH, Sung MW, Roh JL, Kwon TK, Lee SJ, Choi SH, Wang SG, Sung MY: Videostrobokymographic analysis of benign vocal fold lesions. Acta Otolaryngol 2003,123(9):1102-1109. 10.1080/00016480310001880View ArticleGoogle Scholar
- Kiritani S, Honda K, Imagawa H, Hirose H: Simultaneous high-speed digital recording of vocal fold vibration and speech signal. Proc IEEE ICASSP'86 1986, 11: 1633-1636.Google Scholar
- Kiritani S, Hirose H, Imagawa H: High-speed digital image analysis of vocal cord vibration in diplophonia. J Speech Commun 1993, 13: 23-32. 10.1016/0167-6393(93)90056-QView ArticleGoogle Scholar
- Lee JS, Kim E, Sung MW, Kim KH, Park KS: A method for assessing the regional vibratory pattern of vocal folds by analysing the video recording of stroboscopy. Med Biol Eng Comput 2001,39(3):273-278. 10.1007/BF02345279View ArticleGoogle Scholar
- Matassini L, Manfredi C: Software corrections of vocal disorders. Comput Methods Programs Biomed 2002,68(2):135-145. 10.1016/S0169-2607(01)00161-4View ArticleGoogle Scholar
- Mendez A, Ismaili Alaoui EM, García B, Ibn-ElHaj E: Glottal Space Segmentation from Motion Estimation and Gabor Filtering. Minneapolis, USA; 2009. Proceedings of EMBC09View ArticleGoogle Scholar
- Mendez A, Lopetegui E, Garcia B Proceedings of ISCCSP12. In Vocal Folds Paralysis Classification using FLDA and PCA algorithms suported by an Adapted Block Matching Algorithm. Rome, Italy; 2012.Google Scholar
- Nerrière E, Vercambre ML, Gilbert F, Kovess-Masféty V: Voice disorders and mental health in teachers: a cross-sectional nationwide study. BMC Publ Health 2009, 9: 370. 10.1186/1471-2458-9-370View ArticleGoogle Scholar
- Núñez BF: Validación de la versión traducida al español del índice de incapacidad vocal (voice handicap index). Acta Otorrinolaringol Esp 2007,58(9):385. 10.1016/S0001-6519(07)74953-1View ArticleGoogle Scholar
- Rodríguez-Parra MJ, Adrián JA, Casado JC: Voice therapy used to test a basic protocol for multidimensional assessment of dysphonia. J Voice 2009,23(3):304-318. 10.1016/j.jvoice.2007.05.001View ArticleGoogle Scholar
- SEORL 2012.http://www.seorl.net/ . Accessed 26 August
- Weigelt S, Krischke S, Klotz M, Hoppe U, Köllner V, Eysholdt U, Rosanowski F: Voice handicap in patients with organic and functional dysphonia. HNO 2004,52(8):751-756.View ArticleGoogle Scholar
- Yanagihara N: Significance of harmonic changes and noise components in hoarseness. J Speech Hear Res 1967, 10: 531-541.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.