Companies

DeepSeek Unveils Janus Pro, Taking Aim at OpenAI's DALL-E 3

Published January 28, 2025

Just a week after making waves with the release of its R1 large language model (LLM), the Chinese company DeepSeek is back in the spotlight with a new product aimed at rivaling OpenAI's DALL-E 3.

On Monday, amidst a backdrop of ongoing cyberattacks, DeepSeek launched Janus Pro, consisting of two new multimodal LLMs: Janus Pro 1B and Janus Pro 7B. These models are specifically built to handle image generation and vision processing tasks. Similar to DALL-E 3, users can input a prompt to have Janus Pro create a corresponding image.

The new models reportedly improve on DeepSeek's earlier Janus model, which had 1.3 billion parameters and was introduced last year. The enhancement comes from decoupling visual encoding into a distinct pathway, while still using a unified transformer architecture for processing data.

In a research paper detailing Janus Pro's design and capabilities, DeepSeek's developers noted that the initial Janus model showed potential but struggled with short prompts, image generation consistency, and overall text-to-image quality. With Janus Pro, the team claims they have addressed many of these issues by utilizing a larger dataset and increasing the model's parameter counts.

When tested against various multimodal and task-specific models, DeepSeek asserts that Janus Pro 7B slightly surpasses both Stable Diffusion 3 Medium and OpenAI's DALL-E 3 based on the GenEval and DPG-Bench benchmarks. However, it is important to mention that the image analysis capabilities of these models are currently limited to images of 384x384 pixels.

DeepSeek further states that its Janus Pro image models achieve performance levels above those of OpenAI's DALL-E 3 and Stability AI's SD3-Medium. The company claims these high performance results were achieved using a few hundred GPUs operating under the HAI-LLM framework in PyTorch. According to their paper, the training process for the 1.5 billion and 7 billion parameter models took approximately 7 to 14 days on a cluster of 16 to 32 nodes, with each node containing eight Nvidia A100 GPUs.

It seems that training times may have benefited from using aspects of earlier models instead of starting from scratch. DeepSeek has been contacted for clarification on this matter.

Despite being competitive with other multimodal LLMs and diffusion models, DeepSeek acknowledges there is still room for improvement. The researchers mention that the limited input resolution of 384x384 pixels can hinder the model's ability to perform well on more detailed tasks, such as optical character recognition (OCR). Furthermore, this resolution limitation can lead to image generations that lack intricate details.

The Janus codebase has been made freely available under an MIT license, though the use of the Pro models is governed by DeepSeek's specific Model License, which is accessible on their website.

For those interested in trying out either of the Janus Pro models, DeepSeek provides quick-start scripts for local testing on their GitHub page. A demo is also available on Hugging Face Spaces, though users should be aware that it can take several minutes for the demo to fully load during testing.

DeepSeek's latest offerings have spurred significant reactions in the market, causing stocks in Silicon Valley to take a notable dive as the perception of US lead in AI technology and the need for extensive infrastructure is reevaluated. However, the company has faced challenges, including censorship issues.

Additionally, DeepSeek was compelled to restrict new signups for its AI chatbot service on Monday due to the ongoing cyberattack.

DeepSeek, JanusPro, OpenAI