Nvidia enters the text-to-image race with eDiff-I, taking on DALL-E, Imagen

Watch the Low-Code/No-Code Summit on-demand sessions to learn how to successfully innovate and achieve efficiencies by upskilling and scaling citizen developers. Watch now.

The realm of artificial intelligence (AI) text-to-image generators is the new battleground for tech conglomerates. Every AI-focused organization is now striving to create a generative model that can show extraordinary detail and conjure up mesmerizing visuals from relatively simple text prompts. After OpenAI’s DALL-E 2, Google’s Imagen, and Meta’s Make-a-Scene made headlines with their image synthesis capabilities, Nvidia has entered the race with its text-to-image model called eDiff-I.

>>Don’t miss our new special issue: Zero trust: the new security paradigm.<

Unlike other large generative text-to-image models that perform image synthesis through an iterative denoising process, Nvidia’s eDiff-I uses an ensemble of expert denoisers who specialize in denoising different intervals of the generative process.

Nvidia’s unique image synthesis algorithm

The developers behind eDiff-I describe the text-to-image model as “a new generation of generative AI content creation tool that offers unprecedented text-to-image synthesis with instant style transfer and intuitive paint-with-words -possibilities.”


Intelligent security stop

On December 8, learn about the critical role of AI and ML in cybersecurity and industry-specific case studies. Register for your free pass today.

register now

In a recently published paper, the authors say that current image synthesis algorithms rely heavily on the text prompt to create text-tailored information, while almost entirely disregarding text conditioning, redirecting the synthesis task to producing output with high visual fidelity. This led to the realization that there might be better ways to represent these unique ways of the generation process than sharing model parameters across the entire generation process.

“Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized in different stages of synthesis,” the Nvidia research team said in their paper. “To maintain training efficiency, we initially train a single model, which is then gradually broken down into specialized models that are further trained for the specific stages of the iterative generation process.”

eDiff-I’s image synthesis pipeline consists of a combination of three diffusion models: a base model that can synthesize samples at 64 x 64 resolution, and two super-resolution stacks that can progressively upsample the images to 256 x 256 and 256 resolution, respectively. 1024×1024 .

These models process an input caption by first calculating the T5 XXL embed and text embed. The model architecture for eDiff-I also uses CLIP image encodings calculated from a reference image. These image embeddings serve as a stylized vector, which is further fed into cascaded diffusion models to progressively generate 1024 x 1024 resolution images.

Because of these unique aspects, eDiff-I has much more control over the generated content. In addition to synthesizing text into images, the eDiff-I model has two additional functions: style transfer, which allows you to determine the style of the generated pattern using a reference image, and ‘paint with words’, an application in which the user can create images by drawing segmentation maps on a virtual canvas, a feature useful for scenarios where the user wants to create a specific desired image.

Image source: Nvidia AI.

A new denoising process

Synthesis in diffusion models generally occurs through a series of iterative denoising processes that gradually generate images from random noise, using the same neural network for denoiser throughout the denoising process. The eDiff-I model uses a unique noise removal method where the model trains an ensemble of noise suppressors specialized in noise removal at different intervals of the generative process. Nvidia refers to this new denoising network as “expert denoisers” and claims that this process dramatically improves the quality of image generation.

The denoising architecture used by eDiff-I. Image source: Nvidia AI.

Scott Stephenson, CEO of Diepgramsays the new methods presented in eDiff-I’s training pipeline could be inculcated for new versions of DALL-E or Stable Diffusion, where it enables significant advances in quality and control over the synthesized images.

“It certainly adds to the complexity of training the model, but doesn’t significantly increase the computational complexity in production use,” Stephenson told VentureBeat. “The ability to segment and define what each part of the resulting image should look like could meaningfully speed up the creation process. In addition, it allows man and machine to work together more closely.”

Better than contemporaries?

While other advanced contemporaries such as DALL-E 2 and Imagen use only a single encoder such as CLIP or T5, eDiff-I’s architecture uses both encoders in the same model. Such an architecture allows eDiff-I to generate substantially diverse images from the same text input.

CLIP gives the created image a stylized appearance; however, the output often lacks text information. On the other hand, images created with T5 text embedding can generate better individual objects. By combining them, eDiff-I produces images with both synthesis qualities.

Generate variations from the same text input. Image source: Nvidia AI.

The development team also found that the more descriptive the text prompt is, the better T5 outperforms CLIP, and that combining the two results in better synthesis output. The model has also been evaluated on standard datasets such as MS-COCO, indicating that CLIP+T5 embeddings provide significantly better trade-off curves than either one alone.

Nvidia’s study shows eDiff-I outperformed competitors such as DALL-E 2, Make-a-Scene, GLIDE and Stable Diffusion based on the Frechet Inception Distance or FID – a metric to evaluate the quality of AI-generated images . eDiff-I also achieved an FID score higher than Google’s Imagen and Parti.

Zero-shot FID comparison with recent state-of-the-art models on the COCO 2014 validation dataset. Image source: Nvidia AI.

When comparing generated images through simple and long detailed captions, Nvidia’s study claims that both DALL-E 2 and Stable Diffusion failed to accurately synthesize images with the text captions. In addition, the study found that other generative models either produce spelling errors or ignore some attributes. Meanwhile, eDiff-I was able to correctly model features of English text on a wide variety of samples.

But with that said, the research team also noted that they generated multiple outputs from each method and picked the best ones to include in the figure.

Comparison of image generation through detailed captions. Image source: Nvidia AI.

Current Challenges for Generative AI

Modern text-to-image distribution models have the potential to democratize artistic expression by allowing users to produce detailed and high-quality images without the need for specialized skills. However, they can also be used for sophisticated photo manipulation for malicious purposes or to create misleading or harmful content.

The recent advancement of generative modeling and AI-driven image editing has profound implications for image authenticity and beyond. Nvidia says such challenges can be addressed by automatically validating authentic images and detecting manipulated or fake content.

The training datasets of current large-scale text-to-image generative models are usually unfiltered and may contain biases that are captured by the model and reflected in the generated data. Therefore, it is critical to be aware of such biases in the underlying data and counter them by actively collecting more representative data or using bias correction methods.

“Generative AI imagery models face the same ethical challenges as other areas of artificial intelligence: the provenance of training data and understanding how it is used in the model,” said Stephenson. “Large datasets of labeled images can contain copyrighted material, and it’s often impossible to explain how (or if) copyrighted material was incorporated into the final product.”

According to Stephenson, the speed of model training is another challenge generative AI models still face, especially during their development phase.

“If a model takes between 3 and 60 seconds to generate an image on some of the most advanced GPUs on the market, production-scale implementations will require a significant increase in GPU offerings or figuring out how to generate images in a fraction of the time.” of time. The status quo is not scalable if demand grows at 10x or 100x,” Stephenson told VentureBeat.

The future of generative AI

Kyran McDonnell, founder and CEO of revoltsaid that while today’s text-to-image models do abstract art exceptionally well, they lack the required architecture to construct the priors needed to properly understand reality.

“They’ll be able to approximate reality with enough training data and better models, but they won’t really understand it,” he said. “Until that underlying problem is addressed, we will still see these models make common sense mistakes.”

McDonnell believes that next-generation text-to-image architectures such as eDiff-I will solve many of today’s quality problems.

“We can still expect compositional errors, but the quality will be comparable to where specialized GANs are today in terms of face generation,” said McDonnell.

Similarly, Stephenson said we would see more applications of generative AI in different application areas.

“Generative models trained on a brand’s style and general vibe can generate an infinite variety of creative resources,” he said. “There’s plenty of room for enterprise applications and generative AI hasn’t had its ‘mainstream moment’ yet.”

VentureBeat’s mission is to become a digital city plaza where tech decision makers can learn about transformative business technology and execute transactions. Discover our Briefings.

Add Comment