F0 scale factor, FF scale factor
The
graph is a scatterplot of the geometric mean of the frequencies of the
formants (F1, F2, F3) as a function of the voice fundamental frequency
(F0) for a sample of 3000+ vowels recorded from 10 adult males, 10
adult females,
and 3 groups of children aged 3, 5, and 7 years (Assmann & Katz,
2000; 2005). The graph image is divided into a 25 x 25 grid, yielding
625 combinations of F0 and formant frequency (FF). Clicking on a point
in the grid will play a synthesized version of a sentence originally
spoken by an adult male, but frequency-shifted to have the selected
F0 and FF value. Each point sounds like a slightly different voice.
Voices selected from the left side of the graph are lower in pitch than
those on the right. Voices selected from the bottom of the graph appear
to come from larger individuals (i.e., people with larger vocal tracts)
than those selected near the top. Frequency-shifted versions that have
F0 and FF combinations overlapping or near the acoustic measurements
sound more natural than combinations that are not found in natural
voices.
Analysis-resynthesis was performed using the STRAIGHT vocoder
(Kawahara,
1997, 1999). For further details see Assmann et al. (2002) and Assmann
and Nearey (2003).
References
Assmann, P.F. and Katz, W.F. (2000). Time-varying spectral change in the vowels of children and adults. Journal of the Acoustical Society of America 108(4): 1856-1866.
Assmann, P.F. and Katz, W.F. (2005). Synthesis fidelity and vowel identification. Journal of the Acoustical Society of America 117(2), 886-895.
Kawahara, H. (1997). Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited. Proceedings of the ICASSP, pp. 1303-1306.
Kawahara, H. Masuda-Katsuse, I. de Cheveigné, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction. Speech Communication 27, 187-207.