Is anyone aware of techniques or research on this?
Personally I don’t know of any specific research but afaik the reason why our voices sound terrible to us when recorded is because they’re missing the bass that’s transmitted through the skull, i.e. they sound higher pitched on recording. So I’d try increasing the amplitude of the lower frequencies on the EQ, it’ll take a bit of fiddling to get the right balance.
Probably add some reverb as well.
and a tinnitus hum
baby you got a stew goin
Here is an alternative Piped link(s): https://piped.video/ySQQdEJPGfE
Piped is a privacy-respecting open-source alternative frontend to YouTube.
I’m open-source, check me out at GitHub.
I’ve found in my experience that over the years my internal voice updated to match how I sound when recorded. So when I hear myself speaking on a recording it’s much less jarring now since it feels much more like how I predicted it would sound.
I know this doesn’t help your current situation but it’s a fun fact I recognized since you mentioned it.
Yup same, after a bunch of podcasting I don’t hear any meaningful difference between my voice in my head and in a recording.
Which means that modifying a recording to match what you hear really can’t be boiled down to a series of steps, because it’s highly subjective and probably not consistent.
As a singer who has recorded myself not only singing, but also talking for thousands of collective hours of recording time, this is my experience as well. I can’t even hear the difference anymore, and I haven’t been able to since probably hour 10.
A very simple approximation of your voice as it’s heard if you’re facing someone when speaking would be using a unidirectional mic and recording yourself with the mic pointing the opposite direction as it normally would be (in other words— with the polarity reversed).
A slightly better approximation would be if you did the same thing but with two unidirectional mics pointed at slight angles (with the polarity still reversed) to simulate the placement of your ears.
Obviously the quality of the mic would factor in as well—you’d want mics with a flat frequency curve. To get even pickier you’d also want to use headphones or speakers with a flat frequency response to listen to it. Once you had the recording you could even take impulse responses of certain rooms and process the audio to get an idea of how you sound to others in specific rooms!
“…the way we hear ourselves?”
Yes they want to hear their own voice on a recording but make the audio sound close to the way it usually does from their perspective as the speaker…
Yeah that wouldn’t really do it. That ignores the boneand body Conduction which would be a significant contribution to the sound you hear. I’d expect a huge low frequency boost.
deleted by creator
Here you go.
You record the vibration of all the little receiver hairs in your ear as you speak. You simultaneously record into a microphone at the same time.
You train a neural network to transform the sound file from the recording into the set of vibration states it corresponds with in your ear hairs.
Then you just run the resulting model on the sound file, and magnetically vibrate the ear hairs in the way the model predicts.
Boom! 1st person speaking voice experience.
Please God no. I cannot fathom how people manage to hear my real voice.
I sound so much better inside my skull
I mean isn’t that what OP is trying to figure out? A way for people to hear how you sound inside your skull.
So to clarify, I am really curious about speech comprehension if a Text To Speech AI is trained to sound like ourselves (myself).
Over the last couple of weeks I have been working on learning about and using offline AI. I don’t know why, but I decided to ask here first.
If anyone is curious, this is what my current favorite LLM has to say (Llama 2 70B Q4 GGML):
You
what does it take to tune a recording of my voice and match it to how I hear my own voice?
Assistant
To tune your recorded voice so that it matches the way you perceive your own voice, there are several steps involved in audio processing. Here’s an overview of the process:
-
Recording: Firstly, ensure that you have a high-quality recording of your voice using proper equipment such as a microphone and soundproofing to minimize background noise. This will provide a good starting point for further editing.
-
Equalization (EQ): Use an equalizer to adjust the frequency balance in your recorded audio. You can boost or cut specific frequencies that are too prominent or lacking, respectively. For example, if you feel like your voice sounds nasally, you may want to reduce some of the mid-range frequencies around 1kHz and above.
-
Compression: Apply compression to even out the dynamic range in your recording. This will help balance the loudness between different parts of your speech or singing, making it sound more consistent.
-
Reverb/Delay: Add a small amount of reverb or delay to create a sense of space and depth around your voice. This can make it feel like you’re hearing yourself in a room rather than through headphones or speakers.
-
Stereo Imaging: Use stereo imaging tools to adjust the width and positioning of your voice within the mix. This will help create an illusion that your voice is coming from a specific location, making it sound more natural.
-
Monitoring: Finally, listen back to your processed audio on different devices such as headphones or speakers in various environments (e.g., quiet room vs noisy cafe) to ensure that the changes you’ve made translate well across different listening scenarios. Make further adjustments if necessary until you are satisfied with how it sounds.
Remember, everyone perceives their own voice differently due to bone conduction and other factors, so there may be some trial-and-error involved in finding a sound that feels right for you. It’s also important to keep in mind that the goal is not necessarily to make your recorded voice identical to how it sounds when speaking but rather to create an enjoyable listening experience for others who will hear it through speakers or headphones.
You
Is there any formulaic approximation or rule regarding bone conduction and how humans perceive their own voice?
Assistant
Bone conduction is the process by which sound vibrations are transmitted directly to the inner ear via the skull, bypassing the outer and middle ears. This occurs when we speak or sing, causing us to hear our voices differently than others do because of this direct transmission pathway.
There isn’t a specific formulaic approximation or rule that can accurately predict how each individual perceives their own voice through bone conduction since it varies from person to person due to factors such as head shape and size, vocal tract resonance, and the acoustics of the surrounding environment. However, in general, people tend to hear their voices as deeper or richer when speaking because low-frequency sounds are more efficiently transmitted through bone conduction than high frequencies.
It’s important to note that while understanding how we perceive our own voice can be helpful for tuning a recording of it, the goal should ultimately be to create an enjoyable listening experience for others who will hear your recorded voice through speakers or headphones rather than trying to perfectly replicate what you hear when speaking.
-
My guess is there’s some subjective components to it that simple changes wouldn’t be able to adjust to, each person will have to do the changes themselves to make it match what they’re hearing…
I always wondered if this is possible.
I’m not convinced it’s possible. My internal voice has a different accent, not just different acoustic properties.
Not your inner monologue, they mean your actual physical spoken voice, as heard from inside your head.
If you’ve never done this before, a really great way to hear what your voice sounds like two other people is to stand about a foot away from a 90° corner in your house and talk normally.
Your actual voice will bounce off of the walls, making it much easier for you to hear what you sound like to others.
Professional singers use this trick early on in their training so that they can practice their tone for their singing.
Ah I see, thanks
I’ve been thinking about this exact same thing just today. Spooky.