Vereint
2023, Karlsruhe, GermanyAn homage to Peter Weibel's song Entzweit. It was purely generated from the songtext through audio-responsive latent space interpolations. The video was premiered at ZKM's internal memorial ceremony for Peter Weibel.
Vereint (engl. combined) is an inofficial music video to the song Entzweit (engl. divided) by Hotel Morphila Orchester, a band which the former director of the ZKM, Peter Weibel, founded in the 70s. In the song Peter Weibel sings about functional parts of the human body that come in pairs, eyes, ears, legs, hands, etc. These pairs could as well be divided to form two separated entities. In contrast to that, the video has been produced by a human and an artificial intelligence, joining together to form a single author. The content of the images connects to the songtext while the movement in the video depends on the audio. The production was fully automated and no postprocessing was needed.
How do we create a movie with Stable Diffusion?
First let's examine Stable Diffusion's two different input vectors: a text embedding and a
latent vector (noise). While text embeddings define the semantic content, the noise vector can
be viewed as the structural seed for the generated images. In order to create moving images, we
can interpolate between two vectors before feeding them into the model resulting in smooth
transitions between the starting images. As the basis of the image generation, I transcribed the
songtext and chopped it line by line into sections of about 3 to 4 seconds, aligning with the
audio cues. Then each section was given its own noise vector, allowing me to search for a
suitable starting image for that phrase. Furthermore, I chopped up the audio into sections of
the same length and calculated the accumulated audio volume for each frame in the segment. This
volume information is then used to control the speed of the transition between the two noise
vectors, resulting in a more dynamic and responsive animation. I did play around with
interpolating the text prompts, but figured it would be beneficial to keep one of the two inputs
constant and use it to create sharp transitions on certain audio cues. I decided to keep the
text prompts constant so that the image content would stay relevant to the current line and not
the previous or next one.
After I had this algorithm in place, it was a simple task of finding suitable noise vectors for each
line of the song and coming up with a global theme that is applied to all text prompts.