The human voice is a rich medium which serves as a primary source for communication between individuals. It is one of the most natural, energy-efficient ways of interacting with each other. The voice, as complex arrays of sound coming from our vocal cords, contains various information and plays a fundamental role for social interaction  by allowing us to share insights about our emotions, fears, feelings, and excitation by modulating its tone or pitch.
With the purpose of reaching a human-like level, the development of artificial intelligence (AI), technologies, and computer sciences has led the way to new opportunities for the field of digital health, the ultimate purpose of which is to ease the lives of people and healthcare professionals through the leverage of technologies. This is no difference regarding voice. Today, voice technology is even considered as one of the most promising sectors, with healthcare being predicted to be a dominant vertical in voice applications. By 2024, the global voice market is expected to represent up to USD 5,843.8 million .
Virtual/vocal assistants on smartphones or in smart home devices such as connected speakers are now mainstream and have opened the way for a considerable use of voice-controlled search. In 2019, 31% of smartphone users worldwide used voice tech at least once a week , and 20% of queries on Google’s mobile app and Android devices were voice searches. If current voice searches are mostly restricted to basic questions, perspectives for rapid expansion in the healthcare sector are numerous. The evolution of voice technology, audio signal analysis, and natural language processing/understanding methods have opened the way to numerous potential applications of voice, such as the identification of vocal biomarkers for diagnosis, classification, or patient remote monitoring, or to enhance clinical practice .
In this review, we offer a comprehensive overview of all the present and future applications of voice for health-related purposes, whether it be from a research, patient, or clinical perspective. We also discuss the key challenges to overcome in the near future for a large, efficient, and ethical use of voice in healthcare (Table 1).
References for this review were identified through searches of PubMed/Medline and Web of Science with search terms related to voice, vocal biomarker, voice signature, conversational agents, chatbot, and famous brands or vocal assistants (see the full list of keywords in online suppl. material 1; for all online suppl. material, see www.karger.com/doi/10.1159/000515346). The search was performed on December 26, 2020. Only articles, reviews, and editorials referring to studies in humans and published in English were finally considered. Articles were also identified through searches of the authors’ own files and in the grey literature. The final reference list was generated on the basis of originality and relevance to the broad scope of this review.
A biomarker is a factor objectively measured and evaluated which represents a biological or pathogenic process, or a pharmacological response to a therapeutic intervention , which can be used as a surrogate marker of a clinical endpoint . In the context of voice, a vocal biomarker is a signature, a feature, or a combination of features from the audio signal of the voice that is associated with a clinical outcome and can be used to monitor patients, diagnose a condition, or grade the severity or the stages of a disease or for drug development . It must have all the properties of a traditional biomarker, which are validated analytically, qualified using an evidentiary assessment, and utilized .
Work on vocal biomarkers have mainly been performed in the field of neurodegenerative disorders so far, on Parkinson’s disease in particular, where voice disorders are very frequent (as high as 89% ) and where voice changes are expected to be utilized as an early diagnostic biomarker [9, 10] or marker of disease progression [11, 12], and could one day supplement the state-of-the art manual exam to assess symptoms to guide treatment initiation  or to monitor its efficacy . These voice disorders are mostly related to phonation and articulation, including pitch variations, decreased energy in the higher parts of the harmonic spectrum, and imprecise articulation of vowels and consonants, leading to decreased intelligibility. Even though changes in voice are often overlooked by both patients and physicians in early stages of the disease, the objective measures show changes in voice features  in up to 78% of patients with early stage Parkinson’s disease .
Alzheimer’s Disease and Mild Cognitive Impairment
Subtle changes in voice and language can be observed years before the appearance of prodromal symptoms of Alzheimer’s disease  and are also detected in early stages of mild cognitive impairment . Both mild cognitive impairment and Alzheimer’s disease are proven to affect the verbal fluency, reflected by the patient’s hesitation to speak and slow speech rate, or other impairments, such as word finding difficulties, leading to circumlocution and frequent use of filler sounds (e.g., uh, um), semantic errors, indefinite terms, revision, repetitions, neologisms, lexical and grammatical simplification, as well as loss of semantic abilities in general . Discourse in Alzheimer’s disease patients is characterized by reduced coherence, with implausible and irrelevant details . Alterations have been also perceived in prosodic features (pitch variation and modulation, speech rhythm) and may affect the patient’s emotional responsiveness [17, 20]. Voice features have the potential to become simple and noninvasive biomarkers for the early diagnosis of conditions associated with dementia .
Multiple Sclerosis and Rheumatoid Arthritis
Voice impairment and dysarthria are frequent comorbidities in people with multiple sclerosis . It has also been suggested that voice characteristics and phonatory behaviors should be monitored in the long term to indicate the best window of time to initiate a treatment such as deep brain stimulation in people with multiple sclerosis . Some voice features have already been identified as top candidates to monitor multiple sclerosis: articulation, respiration, and prosody . In people with rheumatoid arthritis, pathological changes in the larynx occur with disease progression; therefore, tracking voice quality features has already been shown to be useful for patient monitoring .
Mental Health and Monitoring Emotions
Stress is an established risk factor of vocal symptoms. It was shown that smartphone-based self-assessed stress was correlated with voice features . A positive correlation between stress levels and duration of verbal interaction  has also been reported. Voice symptoms seem more frequent in people with high levels of cortisol , which is common in patients with depression; therefore, voice characteristics are used to discover depression symptoms  or estimate depression severity. The second dimension of a Mel-Frequency Cepstrum Coefficient (MFCC) audio signal decomposition has been shown to discriminate depressive patients from controls . An automated telephone system has been successfully tested to assess biologically based vocal acoustic measures of depression severity and treatment response  or to compute a post-traumatic stress disorder mental health score . Beside acoustic measures, the linguistic aspects of voice are likely to be affected in mental diseases. Discourse tends to be incoherent in schizophrenia, manifested by disjointed flow of ideas, nonsensical associations between words, or digressions from the topic. Circumstantial speech is prominent in patients with bipolar and histrionic personality disorders . Recent methodological developments have also allowed for improved emotion recognition accuracy , which enables sufficient maturity to be reached for medical research to monitor patients in between visits or to gather real-life information in clinical or epidemiological studies.
Cardiometabolic and Cardiovascular Diseases
A team from the Mayo Clinic has identified several vocal features associated with a history of coronary artery disease . Regarding diabetes, only one study has studied vocal characteristics in people with and without type 2 diabetes showing differences between the 2 groups for many features (jitter, shimmer, smoothed amplitude perturbation quotient, noise to harmonic ratio, relative average perturbation, amplitude perturbation quotient ). It has been demonstrated that people with type 2 diabetes with poor glycemic control or with neuropathy had more straining, voice weakness, and a different voice grade , and that the most common type 2 diabetes phonatory symptoms were vocal tiring or fatigue and hoarseness .
COVID-19 and Other Conditions with Respiratory Symptoms
More recently, considerable research activity has emerged to use respiratory sounds (e.g., coughs, breathing, and voice) as primary sources of information in the context of the COVID-19 pandemic . COVID-19 is a respiratory condition, affecting breathing and voice, and causing, among other symptoms, dry cough, sore throat, excessively breathy voice, and typical breathing patterns. These are all symptoms that can make patients’ voices distinctive, creating recognizable voice signatures and enabling the training of algorithms to predict the presence of a SARS-COV-2 infection or as a tool to grade the severity of the disease. Results on vocal biomarkers to aid the diagnosis of COVID-19 by Cambridge University (Area Under the ROC Curve, AUC = 80%), or more recently by MIT scientists (AUC = 97%, based on cough recordings only) are promising . Other projects based on cough sounds are ongoing  with the objective of developing a robot-based COVID-19 infection risk evaluation system. Future work should focus on the impact of the age category or the cultural background on the performances of cough-based algorithms, before launching such pre-screening tools on a large scale.
The Process to Identify a Vocal Biomarker
Below is a description of the typical approach to identify a vocal biomarker (Fig. 1).
Types of Voice Recordings
There is no standard protocol for voice recording to identify vocal biomarkers, but one can classify the sounds emitted from a human’s mouth and analyze them for disease diagnostics into 3 main categories: verbal (isolated words, short sentence repetition, reading passage, running speech), vowel/syllable (sustained vowel phonation, diadochokinetic task), and nonverbal vocalizations (coughing, breathing). In a paper from the Mayo Clinic, study participants were asked to perform three 30-s separate voice recordings : read a prespecified text, describe a positive emotional experience, and describe a negative emotional experience. There is an ongoing debate on the efficiency of use of isolated words or text, that are read aloud, and spontaneous conversational speech recordings [15, 42]. In order to have control over the recorded vocal task, but to allow patients to choose their own words to preserve the naturalness, semi-spontaneous voice tasks are designed where the patient is instructed to talk about a particular topic (e.g., picture description or story narration task). Sustained vowel phonations are another common type of recording, where participants are requested to sustain voicing of a vowel for as long and as steadily as they can. Sustained vowel phonations carry information for evaluating dysphonia, and enable estimating a patient’s voice without articulatory influences, unaffected by speaking rate, stress, or intonation, and less influenced by the dialect of the speaker . This is particularly helpful for multilingual analyses , to avoid confusion caused by different languages or accents. Diadochokinetic tasks are frequently used for the determination of articulatory impairment and include fast repetition of syllables, which combine plosives and vowels (e.g., /pa/-/ta/-/ka/). This task requires rapid movements of the lips, tongue, and soft palate, and reveals the patient’s ability to retain their speech rate and/or intelligibility .
Sustained vowels and diadochokinetic tasks provide a greater level of control in comparison to conversational speech since they have reduced psychoacoustic complexity with less variability in vocal amplitude, frequency, and quality. However, voice performance is altered to a greater extent in spontaneous speech than in controlled tasks . For example, voice disruptions and voice quality fluctuations are much more evident in conversational speech . It better elicits the dynamic attributes of voice and varying voice patterns that occur in daily voice use, but the feature extraction is more difficult. Thus, the choice of a type of voice recording also depends on the objective: is it primarily diagnostic or developing a more comprehensive understanding of voice disorder.
Data Collection Techniques
Different data collection techniques have been developed over the past decades. They can be grouped into 4 main categories:
Studio-based recording includes speech recording into a controlled environment which leads to reduced unwanted acoustics and avoid proximity effects. This often induces an exaggeration of low-frequency sounds due to the proximity of the sound source from a microphone. In general, the recommended distance is between 15 and 30 cm. The collected data via this technique are in general not suitable for a speech application environment.
Telephone-based recording which requires data collection from a variety of speakers and handsets where several disadvantages, such as handset noise, a lack of control over the speaker’s environment, and bandwidth limitations, are frequent.
Web-based recording is a very popular technique for large-scale data collection campaigns and relies on internet access, which is becoming readily available.
Smartphone-based recording provides broadband quality using smartphone devices, which are becoming widely available and at a low cost. Smartphone/web-based recording has the same potential drawbacks of telephone-based recording apart from the bandwidth limitation.
A pre-processing step is therefore necessary to overcome most of these limitations.
A first step before analyzing the data is the audio pre-processing. This includes steps such as resampling, normalization, noise reduction, framing, and windowing the data , as described in Figure 2. The normalization step improves the performance of feature detection by reducing the amount of different information without distorting differences in the ranges of values. Moreover, in traditional non-machine-learning-based approaches for noise detection and reduction, a clean voice estimation is obtained by passing the noisy voice through a linear filter. However, many recent methods work to define mapping functions between clean and noisy voice signals using neural networks. The framing step consists of dividing the voice signal into a number of samples. These are multiplied by a window function to reduce signal leakage effects, which are the discontinuous signals that can cause noise in the subsequent fast Fourier transform. Once these steps have been performed, feature extraction can start.
Audio Feature Extraction
Prior to data analysis, there is a need to convert the audio signal into “features,” meaning the most dominating and discriminating characteristics of a signal which will later contribute to training machine learning algorithms . Various methods are proposed in the literature to identify acoustic features from the temporal, frequency, cepstral, wavelet, and time-frequency domains . The prosodic (pitch, formants, energy, jitter, shimmer) or spectral characteristics (spectral flux, slope, centroid, entropy, roll-off, and flatness), voice quality (zero-crossing rate, harmonic-to-noise ratio, noise-to-harmonic ratio), or phonation (fundamental frequency, pitch period entropy)  parameters can be extracted and analyzed. Nonlinear dynamic features, such as correlation dimension, fractal dimension, recurrence period density entropy, or Lempel-Ziv complexity, are able to describe the generation of nonlinear aerodynamic phenomena during voice production. Segmental features, such as MFCCs, may be the most frequently used in speech analysis , followed by perceptual linear prediction coefficients, and linear frequency cepstral coefficients . Usually, the first 8–13 MFCC coefficients are sufficient to represent the shape of the spectrum even if some applications need a higher order to capture tone information.
Contrary to acoustic features which are able to capture the motor speech impairments, cognitive impairments may require analyzing linguistic features which reflect the parts of speech, vocabulary diversity, lexical and grammatical complexity, syntactic structures, semantic skills, and sentiment . Before starting linguistics feature extraction and analyzing, linguistic annotation is a necessary step to define the sentence boundaries, parts of speech, named entities, numeric and time values, dependency, and constituency parses. Linguistic analyses often require extended speech production to extract features at all linguistic levels: phonetic and phonological (number of pauses, total pause time, hesitation ratio, speech rate), lexico-semantic (average rate of occurrence for each part of speech, number of repetitions, semantic errors, and closed-class word errors), morphosyntactic and syntactic (number of words per clause, number of dependent and simple clauses, number of clauses per utterance, mean length of utterances), and discourse-pragmatic (cohesion, coherence ).
The correct choice of features heavily depends on the voice disorder, disease, and type of voice recording. For example, acoustic features extracted from sustained vowel phonations or diadochokinetic recordings are common in the detection of Parkinson’s disease, whereas linguistic features extracted from spontaneous or semi-spontaneous speech may be a more appropriate choice for the estimation of Alzheimer’s disease or mental health disorders.
Audio Feature Selection and Dimensionality Reduction
Feature selection methods such as the mRMR (minimum redundancy maximum relevance) , Gram-Schmidt orthogonalization  allow a subset of the original feature set to be selected without changing them, as illustrated in Figure 2. It removes highly correlated features as well as features with missing values or low variance. This helps to select, for a given outcome of interest, the most relevant set of features to consider for the prediction or classification task. Besides, to avoid a “curse of dimensionality,” dimensionality reduction methods such as principal component analysis, linear discriminant analysis, random forests, or stochastic neighbor embedding can be used to transform features and perform data visualization .
Training of Algorithms
Following the selection of features, machine or deep learning algorithms, such as support vector machines, hidden Markov models, convolutional or recurrent neural networks, just to name a few, can be trained to automatically predict or classify any clinical, medical, or epidemiological outcome of interest, from vocal features alone or in combination with other health-related data . Algorithms are usually trained on one dataset and then tested on a separate dataset. External validation is still rare in the literature, mainly due to a lack of available data. Although supervised learning algorithms are commonly used as predictive models, extracting the implicit structures and patterns from the voice data using unsupervised learning techniques is also possible. Transfer learning is another promising approach which benefits from pre-training the model on a large voice dataset in a different domain where data are easier to collect, and fine-tuning the model in a target voice dataset, which is typically much smaller.
Testing of Algorithms
Collection of large-scale datasets for people with voice impairments is rarely feasible; therefore, in order to have reliable estimates of the performance, cross-validation and out-of-bootstrap validation techniques can be used. In cross-validation the dataset is randomly partitioned into k approximately equally sized subsets (folds), one being used for testing and the remaining ones for training. The performance is averaged over all folds. Leave-one-out cross-validation is an extreme case of cross-validation when the number of folds is equal to the number of data instances, meaning that the model is trained on all data except one data instance. In bootstrap validation data instances are sampled with replacement from the original dataset, thus producing the surrogate datasets of the same size that may contain repeated data instances or miss data instances from the original dataset. If the unsampled data instances are used for testing, the method is called out-of-bootstrap validation.
Various performance metrics are used depending on specific application and the dataset, including accuracy, specificity, sensitivity (recall), precision, F measure, and AUC, just to name a few. The right choice of the metrics is very important since it guides the selection of the prediction model, but also affects interpretation of the results. For example, using accuracy for a heavily imbalanced classification problem could be misleading, since high performance can be reached by a model that always predicts the majority class. Sensitivity-specificity and precision-recall metrics are better choices in that case.
From Research to Clinical Practice
Once a vocal biomarker has been identified, as with any biomarker, the path is still long to a clinical routine use. For vocal biomarkers there are additional challenges, as their validity may be restricted to some languages or accents. The US Food and Drug Administration or European Medicines Agency have not approved any vocal biomarkers yet. Therefore, we can only speculate on the theoretical framework of such a process in the future, taking into account close cases in traditional biomarkers  and challenges in digital health. The first step would be to develop standards for vocal biomarker collection and create large-scale voice sample repositories for clinical use. This should be followed by integrating the algorithm into a user-friendly device (smartphone app, smart home device, connected medical device, etc.), co-designed with the end-users if possible. It should then enter sequentially into a feasibility study, one or several clinical trials, as well as real-world studies. It will not be the algorithm alone but its embedding in a connected medical device which will be approved by the agencies, and this major step has not been taken yet. Besides, given the technical constraints, we suspect that the first vocal biomarkers to be validated will be restricted to a specific language or a specific sub-group of the population. A relevant template to help standardizing and evaluating speech-based digital biomarkers has recently been proposed . Health check-ups could one day be performed directly on an everyday device such as a smart mirror to track digital biomarkers, including vocal biomarkers, activity, healthcare status, and body movement . For seniors, voice can also be a preferred medium to communicate inside a smart home to exchange with remote family members, in case of an emergency or for telemedicine [53, 54]. In pilot studies, it has been shown that it is overall well accepted but highly dependent on the task complexity and the cognitive abilities of the individuals .
Future of Voice for Health
In this review, we have summarized the main fields of use today and in the coming years. Soon, the field will likely move from audio only to video; adding images to the voice will help to better characterize patients, including their emotions or other health characteristics from facial recognition, which, in combination with vocal biomarkers, will ease the remote monitoring of health [56-61]. The increase in data transfer capabilities, using the 5G networks and future updates, combined with an increasing proportion of the population with a smartphone equipped with a vocal assistant or at-home devices, will ease the collection and processing of large vocal samples in raw format or high definition . From a research point of view, we can expect further inclusion of voice-related secondary endpoints in trials and real-world studies. From a healthcare point of view, the inclusion of voice analysis in health call centers will enable augmented consultations, a more accurate authentication of the caller, and real-time analysis of health-related features. Voice technologies will soon be further integrated into the development of virtual doctors and virtual/digital clinics  (Fig. 3).
Ethical and Technological Challenges to Tackle
Voice technologies and vocal biomarkers have to take the language and accent into account before being used on a large scale, otherwise they may increase systemic biases towards people from specific regions, backgrounds, or with a specific accent, and could increase a pre-existing digital and socioeconomic divide already present in some minorities (Table 2). To that extent, the voice technology field can learn from other fields, such as radiology for which the use of AI is much more advanced and where systemic biases have already been documented . On top of that, some voice-specific issues will have to be dealt with, as for many applications of vocal biomarkers it is likely that language-, accent-, age-, or culture-specific features are identified first, before moving to more universal, language- and accent-independent features. The right balance will have to be found between hyper-personalization for a given user and universal assessment of the clinical benefit of a vocal biomarker. There is also a need to improve natural language processing and understanding capabilities, relevance, and the accuracy of answers of vocal assistants, increase the fluidity in human-vocal chatbots interaction, and include emotions and empathy in the dialogue, if we ever want to reach massive and long-term adoption.
The validation of vocal biomarkers against gold-standards is mandatory for a safe use of voice to monitor health-related outcomes. Too few studies are available yet to enable a switch from novelty in small feasibility studies to large-scale clinical development .
One now needs proper evaluation of usability, adaptability, efficacy, and safety, but also sociological and ethical implications of using vocal biomarkers and voice technologies. The question of interoperability with existing technologies, integration within the various health systems, and long-term business models remains to be solved. Gathering more data is required to make reliable estimates; therefore, we strongly recommend the establishment of large data banks of labelled audio datasets with associated clinical outcomes. The next step will be to embed the algorithms in a digital device (should it be a vocal assistant, a smartphone, or a smart mirror ) and run prospective randomized controlled trials, real-world evaluation, and qualitative studies before envisaging a scale-up. The field needs to move towards a standardization of vocal biomarker collection in terms of data and formats to work with, to ensure cross-comparisons, compatibility, and transferability. Sharing data is also needed, as it will ensure the development of more accurate vocal biomarkers and voice technologies. As any field impacted by AI, voice technologies or vocal biomarkers need to rely on algorithms trained on diverse datasets to limit biases towards under-represented groups of the population.
Voice data is considered sensitive as it can be used to reveal the person’s identity, demographic or ethnic origin, or in cases of vocal biomarkers also the health status. Measures, such as encrypting voice data, splitting data into random components, each of them independently processed to securely process voice data without privacy leakage, or learning data representations from which sensitive identifiable information is removed, just to name a few, should be used to address ethical concerns related to voice data collection and processing.
We have discussed numerous applications in healthcare, both for patients and for healthcare professionals. It becomes clear that voice will be increasingly used in future health systems: vocal biomarkers will track key health parameters remotely and will be used for deep phenotyping patients or designing innovative trials, opening the way to precision medicine , while voice technologies will be integrated into clinical practice to ease the lives of both patients and healthcare professionals. For the field to reach maturity, we need to move from a technology-oriented approach to a more health-oriented one, by creating studies and high-value datasets for providing evidence of the benefits of such an approach.
Conflict of Interest Statement
The authors have no conflicts of interest to declare.
G.F. and A.F. are supported by the Luxembourg Institute of Health, Luxembourg National Research Fund (FNR; Predi-COVID, grant No. 14716273), and the André Losch Foundation. M.I. and V.D. are supported by Luxembourg Institute of Science and Technology (LIST) and University of Luxembourg (UL), respectively, as well as by FNR (CDCVA, grant No. 14856868).
All authors designed the study and drafted the first version, critically revised, and approved the final version of the manuscript.