Not long after Sora was released, Stable AI released Stable Diffusion 3. For those who use artificial intelligence for creative design, it is undoubtedly a new year. Then this article is specially prepared for these users. It will describe the two major features of Stable Diffusion 3 in more straightforward terms, "diffusion transformers model" and "flow matching", to help you better use it for creation after the model is released.
Diffusion transformer model (diffusion transformers), we will refer to it as DiTs below. As you can see from the name, this is an image latent variable diffusion model based on the transformer architecture. If you have read Silicon Star Pro's article "Uncovering Sora: Using a Large Language Model to Understand Videos and Realize the "Emergence" of the Physical World", then you are already considered a "class representative" for the following content. . Like Sora, DiTs also uses the concept of "patches", but since DiTs is used to generate pictures, it does not need to maintain the logical association between different frame pictures like Sora, so it does not have to generate time and Space-time chunks of space.
For DiTs, it is similar to the Vision Transformer (ViT) that caused a bloody storm in the field of computer vision 4 or 5 years ago. The image will be divided into multiple patches by DiTs and embedded in In the continuous vector space, a sequence input is formed for processing by the transformer. However, it should be noted here that because DiTs has a business, for conditional image generation tasks, DiTs needs to receive and fuse external condition information, such as category labels or text descriptions. This is usually achieved by providing additional input markers or cross-attention mechanisms, allowing the model to guide the generation process based on given conditional information.
So when this block arrives inside DiTs, it can be processed into the required content by the DiT block inside DiTs. DiT block is the core part of DiTs. It is a special transformer structure designed for diffusion model and can process image and condition information. Generally speaking, block itself is translated as block, but in order to distinguish it from patches, I use block directly here.
DiT block is divided into three small blocks: cross attention, adaLN, adaLN-Zero. Cross-attention refers to adding an additional multi-head cross-attention layer after the multi-head self-attention layer. Its function is to use condition information to guide image generation so that the generated pictures are more consistent with the prompt words, but at the cost of increased About 15% of the calculation effort.
LN in adaLN refers to normalizing the output of the internal units of each layer of neural network to reduce the problem of internal covariate shift (covariate shift) , thereby improving the convergence speed and performance during model training. adaLN is an extension of standard layer normalization, which allows the parameters of layer normalization to be dynamically adjusted based on input data or additional condition information. It is just like the suspension of a car, used to increase the stability and adaptability of the model.
Next, Stable AI makes an improvement based on the adaLN DiT block. In addition to regressing γ and β, it also regresses the dimension-level scaling parameter α and any residuals within the DiT block. These parameters are applied immediately before connecting. This block is adaLN-Zero. The purpose of this is to imitate the beneficial initialization strategy in the residual network to promote effective training and optimization of the model.
After passing through the DiT block, the token sequence will be decoded into the output noise prediction and the output diagonal covariance prediction. With a standard linear decoder, the two predictions are of the same size as the spatial dimensions of the input image. Finally, these decoded tokens are rearranged according to their original spatial layout to obtain the predicted noise values and covariance values.
Chapter 2, Flow Matching (hereinafter referred to as FM). According to Stable AI, it is an efficient, simulation-free CNF model training method that allows the use of universal probabilistic paths to supervise the CNF training process. What is particularly important is that FM breaks the scalable training barrier of CNF beyond the diffusion model and can directly operate the probabilistic path without a deep understanding of the diffusion process, thus bypassing the difficulties in traditional training.
The so-called CNF is Continuous Normalizing Flows, continuous normalizing flow. This is a probabilistic model and generative model technology in deep learning. In CNF, a simple probability distribution is converted into a probability distribution of complex, high-dimensional data through a series of reversible and continuous transformations. These transformations are usually parameterized by a neural network so that the original random variables are continuously transformed to simulate the target data distribution. Translated into vernacular, CNF generates data like rolling dice.
However, CNF requires a lot of computing resources and time in actual operation, so Stable AI wondered whether it could produce a result that is almost the same as CNF, but the process should be stable and the calculation amount should be low. method? So FM was born. The essence of FM is a technology for training CNF models to adapt to and simulate the evolution process of a given data distribution, even if we do not know the specific mathematical expression of this distribution or the corresponding generating vector field in advance. By optimizing the FM objective function, the model can also gradually learn a vector field that can generate a probability distribution that is approximate to the real data distribution.
Compared with CNF, FM should be regarded as an optimization method. Its goal is to train the vector field generated by the CNF model and the ideal target probability path. The vector field is as close as possible.
After reading the two core technical features of Stable Diffusion 3, you will find that it is actually very close to Sora. Both models are transformer models (stable diffusion previously used U-Net), both use blocks, both have epoch-making stability and optimization, and their birth dates are so close. I don’t think it is too much to say that they are related by blood.
However, there is a fundamental difference between the "two brothers", that is, Sora is closed source and Stable Diffusion 3 is open source. In fact, whether Midjourney or DALL·E, they are all closed source, only Stable Diffusion is open source. If you pay attention to open source artificial intelligence, then you must have discovered that the open source community has been in trouble for a long time, there is no obvious breakthrough, and many people have lost confidence in it. Stable Diffusion 2 and Stable Diffusion XL only improve the aesthetics of the generated images, while Stable Diffusion 1.5 already does this. Seeing the revolutionary improvements of Stable Diffusion 3 can rekindle the confidence of many developers in the open source community.
To talk about another exciting thing, Stable AI CEO Mohammad Amad Mostaq (মোহম্মদ ইমাদ মোশতাক) said on Twitter that although Stable AI has more resources in the field of artificial intelligence than others Some companies have reduced it by as much as 100 times, but the Stable Diffusion 3 architecture can already accept content other than videos and images, but it cannot announce too much yet.
You said I can still understand pictures and videos, but what do you mean by "other" content? In fact, the only thing I can think of is audio, which generates pictures through a piece of sound. It’s confusing, but once Stable AI releases the latest research results, we will definitely interpret them as soon as possible.
Preview
Gain a broader understanding of the crypto industry through informative reports, and engage in in-depth discussions with other like-minded authors and readers. You are welcome to join us in our growing Coinlive community:https://t.me/CoinliveHQ