This newsletter is originally in Spanish, and what you're about to read is an attempt to adapt some of my posts into English. I'll be uploading some English versions, which will be available in a separate section.
Since the content is repeated, I promise this will be the only time all my subscribers will receive it. If you know someone who might be interested, feel free to share it.
Each week brings new updates about the use of AI, both in general and specifically in dubbing. It’s quite challenging to keep track and stay up-to-date.
In this post, I’d like to explore a concept I believe is mistaken and show how it ties into current dubbing technologies, in order to reveal their limitations. It might sound ambitious, but let’s see how it turns out. I hope you’ll stay for the entire read.
About creativity and an approach focused only on outcomes:
Several months ago, something happened that some of you may have caught on social media. This event finalized an idea I had been thinking about concerning the way content is created in most of the latest innovations.
It started when Apple unveiled its new iPad Pro with this announcement:
The ad generated significant controversy, leading the company to apologize and decide not to air it on television. Many viewers found the imagery disturbing and saw it as disrespectful to traditional creative tools. I remember a tweet (now in X is “post”?) that went something like this: "The destruction of the human experience. Brought to you by Silicon Valley."
The disappointment was so widespread that some users shared a "fixed" version of the ad, simply by playing it in reverse. Samsung, with a more clever approach, took up the challenge and offered their interpretation of the situation.
Despite its failure, the concept behind Apple’s commercial suggests that creativity can be achieved using tools that combine multiple functions and deliver amazing results. Hold on to that idea: results.
I’m going to include two more related examples.
It’s the "first short film created entirely with Sora." To put it simply, Sora allows you to create realistic video content using just a prompt. Here’s where a small side note comes in, and I’ll leave a link if you want to take a detour: There’s often a lot of “hype” in the announcements made about these new technologies. Just watch the behind-the-scenes video of this short film to see that “entirely” isn’t quite accurate.
Another example: Suno.ai is a platform that allows you to create entire songs, including music, lyrics, and vocals, from simple text descriptions or, once again, prompts. By using AI, it facilitates the rapid and personalized generation of musical content.
Here’s a video that showcases the potential of this technology. It’s certainly impressive, no doubt.
These tools are developed and marketed under the premise of democratizing access to artistic creation, allowing anyone to produce music and videos without prior technical knowledge. But this can lead to the dehumanization of the creative process. Art isn’t just about the final product; it’s about the emotional and collaborative journey. Human interactions, emotional interpretations, and shared experiences are key to the authenticity of art.
Creating a song or a video involves interactions between musicians, directors, actors, and other collaborators. These aren’t just technical interactions; they’re emotional and contextual too. A singer’s emotions while interpreting lyrics or an actor’s reactions on set bring a richness that AI cannot replicate. And it’s not just about the capability—this should remain something for humans, because that’s where our humanity lies.
The traditional creative process is also full of mistakes, revisions, and spontaneous moments that often lead to breakthroughs. AI, by sticking to predefined parameters, tends to eliminate this randomness and imperfection.
Guillermo del Toro's opinion on the matter is very interesting:
At the risk of sounding more philosophical: The result may be pleasing, but can music truly move us if it’s not performed by a real person? Sung by a voice that is a blend or sampling of many others? This also brings up a much broader and unresolved issue: the ethical dilemma of whose work forms the basis for this generation, because nothing is created from scratch.
Reducing art to parameters.
"En aquel imperio, el arte de la cartografía logró tal perfección que el mapa de una sola provincia ocupaba toda una ciudad, y el mapa del imperio, toda una provincia. Con el tiempo, estos mapas desmesurados no satisficieron y los colegios de cartógrafos levantaron un mapa del imperio, que tenía el tamaño del imperio y coincidía puntualmente con él." Suárez Miranda, Viajes de Varones Prudentes, Libro Cuarto, Cap. XLV, Lérida, 1658. («Del rigor en la ciencia» Jorge Luis Borges)
In Borges' short story "Del rigor en la ciencia" (On Exactitude in Science), he presents the paradox of a map so detailed and accurate that it perfectly matches the territory it represents. While theoretically flawless, this map becomes useless in practice due to its overwhelming scale and complexity.
In AI, prompts (instructions given to a model to generate a specific outcome) can be extremely detailed. When a highly specific result is needed, the prompt must contain numerous details. However, if taken to the extreme, creating such a detailed prompt can become as laborious as doing the task manually, defeating the purpose of simplifying or automating the process.
This raises the risk of losing the essence of art through excessive parameterization. Art is not just about meeting technical requirements; it's about expressing emotions, ideas, and experiences in ways that deeply resonate with people. Over-parameterization can strip the creative process of its richness, reducing it to a set of mechanical tasks.
Deep learning technologies often function as a "black box": we input data, the machine processes it, and then delivers a result.
Art and creativity, on the other hand, are evolutionary processes developed over time through interaction with others. While a prompt may deliver an immediate result, it cannot replace the gradual evolution of an idea through reflection, constructive criticism, and collaboration. This evolutionary process is essential to both personal and artistic growth.
How does this relate to dubbing?
First, I need to clarify which part of the dubbing process I’m referring to and provide a broad overview of the technologies available today. The translation and adaptation process is so creative and complex that, at least for now, the best results are still achieved by humans. This is why most companies rely on a human-in-the-loop approach.
Now, I want to focus on voice generation/recording to explore how much of it is driven solely by the result and how much of it could be parameterized.
We have the following options (plus some possible combinations):
AI-Generated Full Dubbing: The AI generates the entire dubbing track from scratch, including voice, intonation, and synchronization with the visual content. Here, there aren’t many adjustable parameters, and the result or quality will depend solely on the algorithms.
Text-to-Speech (TTS) Dubbing: This method generates synthetic voices from text scripts. Advanced TTS systems can produce increasingly natural speech, but synchronization with lip movements remains complex. While a lot depends on adaptation, there’s no "actor" to consider rhythm or lip movements. The more adjustable parameters there are, the longer the prompt or the work required, which brings us back to the dilemma mentioned earlier. For products that don’t require high precision or emotional nuance, this could be a viable solution, but not if we’re aiming for quality.
Voice Guide with AI Adjustment: Actors record a guide track (with the proper interpretation and syncing as done in traditional dubbing), and the AI modifies this track by cloning the original voice or adjusting its tone as needed. In terms of quality, this approach allows for the best possible product. Note that in this case, AI is an added tool to the conventional process (or a combination of both), but it doesn’t replace what is typically done.
In a separate category are applications that modify the image to achieve better phonetic synchronization.
For those interested in diving deeper into these topics, here are a few articles I’ve written on the subject:
https://www.ata-divisions.org/AVD/ai-and-its-implementation-in-the-dubbing-process/
https://apuntesdedoblaje.substack.com/p/innovacion-en-el-doblaje-las-tecnologias
(in Spanish)
Here I’m making a bit of a prediction, and of course, I could be wrong: Even if we achieve technical perfection and everything becomes fully parameterizable (duration, speed, intonation, rhythm, emphasis, phonetic synchronization, etc.), this technology will still be incapable of conveying emotions. Or perhaps it will be able to convey emotions, but only with the right amount of parameters for each individual case (and those cases could be sentences, segments, or even individual words). If that’s the case, wouldn’t it be better to continue using conventional dubbing and instead focus our efforts on improving other aspects of the process?
What can we expect for the future?
Just as Sora can be used to create a short video, even within a larger fiction, or Suno to compose a podcast jingle or a humorous song, certain materials are more susceptible to being dubbed with these tools.
For projects where the focus is on the information, meaning the "what" rather than the "how" (or content over form), such as e-learning courses, institutional videos, or voice-overs where interpretation isn’t crucial, replacement by AI is imminent. In fact, it's already happening. Critics will argue that the quality is inferior, which is true. Defenders, however, will likely point out that this material would probably never have been dubbed in the first place due to the high cost (compared to subtitling). Some companies using this technology already promote their services by acknowledging that AI dubbing doesn’t aim for quality, but rather for scale and speed.
It’s important to clarify something about the so-called quality of conventional dubbing: I’ve witnessed it myself. I’ve been in studios where dubbing for cable networks was done, recording hundreds of hours of programs per month, and no one really understood what was going on. For example, in some cases, the translation was a mess, and technical dialogues for a show about car mechanics were recorded with people speaking automatically without understanding what they were saying. When comparing both final products, I don’t see a substantial difference.
Materials where human interpretation and performance are essential will be much harder to replace. I believe, at least for a while, these projects will continue to be done traditionally or with the assistance of these technologies, simply because it’s the best way to achieve quality.
To wrap up, I believe the best recommendation I can give is to focus on studying, training, and practicing everything that defines our humanity. In the case of dubbing, the nuances required to interpret a role, understand subtext, and convey emotions are things that, for now, only belong to humans. These elements demand a deep understanding of emotions and experiences, something machines are not yet able to achieve with the same precision and sensitivity.
Therefore, my advice to those who work in dubbing or other arts requiring deep human skills is to continue investing in their training and development. Continuous practice in interpretation, empathy, and emotional communication is invaluable. Those who master these skills will be better equipped to work alongside technology, using the tools available to enhance their art without losing the human touch that makes their work special and meaningful.
A comparable case to keep in mind:
To better understand this issue, I always try to find analogies and/or similar examples. We can draw a parallel between our field and what has been happening with production design in film and its evolution (and in some cases, replacement) with CGI.
I’ll illustrate this with a brief example: for the scene where Nick Fury surprises Peter Parker in his bedroom in "Spider-Man: Far From Home" (2019), none of what you see there was real. Samuel L. Jackson and Tom Holland had to act as if they were in a fully furnished room while delivering their lines. This scene was filmed at Warner Bros. Studios Leavesden, located in Watford, England.
Consequences:
The Art Directors Guild (ADG) has decided to suspend its Production Design Initiative (PDI) program for the year 2024. This decision was made due to the high unemployment rate among its members, which some sources estimate at 75%. The ADG stated that they cannot encourage new aspiring professionals to enter the field while many of their current members remain out of work. (Source: Indiewire)
This newsletter will always remain free, but if you believe the content is worth supporting, feel free to collaborate with me by clicking this button. You can do so through PayPal (with a monthly subscription or a one-time donation). Your support helps back my work and fuels the growth of this project.
If you’re not able to contribute, please don’t leave. You’re invited to keep reading the rest of the posts, it’s on me!
A BIG THANK YOU to everyone who is already contributing.