Posts made by tamasg@mindly.social

Tamas G

wow, so you're telling me a GitHub action could build me a Linux x86 and ARCH64 tarball, already packaged for me to have when I draft a new release? Life changed.

Tamas G

I'm building in language scanning into the phoneme editor. If you are modifying a global phoneme or one tied to a language, the phoneme editor should know and warn you that this exists for a given language. This will prevent folks from making stray edits they don't intend to change for it.

Tamas G

ok the phoneme editor is going to get a lot of bug fixes. over 800 lines of code changed in it. After PR number 13, which was someone's Spanish work, the bugs are clear. we won't destroy the comments, we won't destroy the ordering of the yaml, and most important, custom voices won't break when you save phonemes. We will also detect when you're trying to modify a non-language tuned phoneme, and warn you: "you're about to make the phoneme sound different in multiple languages."

Tamas G

Think of it like this: EQ is turning up the volume knob for an entire frequency range. A pole-zero pair is more like carving a shelf into the wall, it changes the shape of the room without making it louder. The highs and mids pass straight through because the pole and zero are both narrow and both down at 600 Hz. By the time you get to F2 territory at 1200+ Hz, the pole-zero pair's influence has rolled off to basically nothing. So latest master branch changes are aimed at exactly adding back the missing chest voice that made it sound hollow.

Tamas G

eSpeak gives "dictionary" and "ordinary" the correct American pronunciation with ˌ on "-ary" but drops it on "secondary." That secondary stress on the -ary syllable is what makes it "secondAIRy" How ironic to me. The word that should have secondary stress is the word that drops it in Espeak.

Tamas G

Just fully in upset, angry bull in chinashop engineer mode now. I started out as not being that, but that's it. Engineer bull in chinashop mode activated.

Tamas G

I'm honestly not excited for CSUN because I know all the blind people will be there to tell me how TGSpeechBox sucks and I should crawl under the ground for making it not like Espeak enough. I get it.

Tamas G

@rommix0 ah, now that filename suddenly makes sense from earlier, nice handle change

Tamas G

@rommix0 so looks like DECTalk used a 3-layer approach to modifying this. I have Layer 2 well defined, but not the first layer and the third one. Layer 1 is specific, large Hz offsets for known phoneme pairs. Not computed from a formula though but hard-coded. Then Layer 3, the forward and backward rules. Very helpful there to know.

Tamas G

@rommix0 yeah, I think the CMUDict work is leading me towards this. It's shown me that part of my problem is just getting vowel stress and cluster targets from Espeak's IPA rather than actual broken down words made by linguists studying it deeply. So the data file is just all the words, broken down into IPA notation through a Python script into Espeak tie-bars and such. Things like, "'frisco ˈfɹɪskoʊ" - rewriting the rules like that first and then not doing an overlay to "correct" for Espeak's quirks will be where that moves, along which I think can come some more formant target passes. We have the EndCF1-3 and EndPF1-3 wired up per frame now, but obviously wiring it up isn't the same thing as using it right.

Tamas G

@rommix0 well, now I think the DSP isn't the issue, it's the phonemes, the passes, DECTalk knows things like "desktop" gets a 60% reduction in fricative burst than the word "start." That's what I'm missing. Coarticulation is great, but it just pulls vowels and consonants more naturally to their targets. These other engines had way more sophisticated rules than just, "same fricative / affricate durrations" passed to the DSP. I've understood a lot more on how pitch controls prosody, but there's this engreediant missing and I don't think I'll ever find it lol

Tamas G

@FreakyFwoof ha, SpeechBox (mostly) sounds the same, I've reduced a lot of the clickyness the old Sawtooth engine had especially on the "D" phoneme and "T" endings. Very clicky. I am glad at least that I could accomplish that in the Speechbox version, but still couldn't move the needle on it sounding "more like Eloquence from the 90s than Eloquence from the 2000s" as someone commented to me the other day.

Tamas G

@FreakyFwoof lol but Eloquence is the gold standard, the one that so many can listen to without their ears going into exhaustion. I think in the community probably DECTalk is the second close, I usually find people in either group, and then the more nitche groups who like harsher things like Espeak or the hybrid formant-concatenated stuff. SpeechBox for many just falls too below DecTalk, above Espeak, but comfortably so that it can't become their daily driver, because RHVoice or the other options are "good enough" for their ears that formant stuff is too robotic. Fair point in that way.

Tamas G

Sadness fills my soul. SpeechBox will never gonna sound like Eloquence. Might as well abandon it. People still find it too sharp and sybilant. Better synths will come along, perhaps something neural anway. It's just a time waste in the end. Sadness.

Tamas G

@raph My trapezoidal SVF has the same issue since it's mathematically
equivalent. At 11025 Hz the upper formants (F5/F6) are getting squeezed noticeably near Nyquist. Pre-warping fixes center frequency but not bandwidth shape.

The voiced lowpass idea has merit. I have spectral tilt on the source already but a proper LP at 4-5 kHz before the resonator bank could help. Oversampling just the voiced path is tempting too, since noise sources don't alias the same way.

Tamas G

wow. this has smoothened out the voice a lot. No more thump on "glottal sharpness."
DSP v7: Replace biquad resonators with trapezoidal SVF, add cosine/log-domain interpolation

Trapezoidal State-Variable Filter (SVF) Resonators ==

The cascade and parallel formant resonators now use Andrew Simper's
trapezoidal-integrated SVF method (Cytomic) instead of the classic
Klatt biquad (Direct Form II).

Benefits:
- Zipper-free formant sweeps: frequency and Q are independent parameters,
so continuously varying formant frequencies during coarticulation and
diphthongs no longer risk coefficient discontinuities or clicks.
- Better low-sample-rate stability: the old biquad lost coefficient
precision near Nyquist, contributing to "old cell phone" artifacts at
11025 Hz. The SVF spreads precision more evenly across the spectrum.
- Nyquist proximity damping: a quadratic bandwidth widening curve above
60% of Nyquist prevents the SVF from over-resonating where the old
biquad naturally degraded. This keeps fricatives clean at 11025 Hz
while having zero effect at 22050+ Hz.

The anti-resonator (nasal zero, rN0) uses a dedicated FIR (all-zero)
filter rather than the SVF notch output. The SVF notch places zeros on
the unit circle (infinitely deep null), which is too aggressive for
speech nasalization. The FIR places zeros inside the unit circle at a
depth controlled by bandwidth, matching the behavior expected by existing
phoneme data.

Tamas G

@kaveinthran oh yeah, all of it is one package but if you want to feel better, can always check in %appdata%\nvda\addons\nvSpeechPlayer\synthDrivers\nvSpeechPlayer\x86 - the date modified should be today on them.

Tamas G

Ah. Our voice tilt already controls radiation mix, this is true. If you put slider below 50. This is why we made Voice Tilt be a bit of a hybrid: Positive values = Klatt-style voice tilt, negative values = radiation mix jumps to 1.0. Good. No new slider. Y'all have been spared.

Tamas G

oh no. another voicing tone slider and this one makes a huge difference too. Called it:
Voice edge (Radiation mix)
We will let you change the 30% to 70% blend. if you want pure synthetic buzz, you can have it.

Tamas G

@RainbowFyre you got it! The cascade is a conga line where everyone holds the shoulders of the person in front, so if everyone squeezes tighter (narrower bandwidth), the whole line gets compressed together. The parallel path is more like six people standing side by side each doing their own dance, then you just add up all their moves at the end! Squeezing one dancer doesn't affect the others.