There is no shortage of powerful, open source Chinese large language models (LLMs) available now for enterprises around the world to take, download, modify/customize and deploy commercially.
Yet, here comes another model family worth consideration: Meituan, a Chinese food delivery and e-commerce app, attracted the attention of AI experts around the world in late August 2025 with the release of its first open source LLM, LongCat-Flash (also known as LongCat-Flash-Chat) a 560-billion parameter Mixture-of-Experts (MoE) foundation model.
Then, this week, the company released LongCat-Flash-Thinking, a large-scale open-source reasoning model designed to push forward the state of complex problem-solving in AI, which builds on the foundation of the base LongCat-Flash model and introduces a training pipeline specifically tuned for advanced reasoning.
Meituan has made LongCat-Flash-Thinking accessible freely via its API, as well, up to 500,000 tokens per day with an option to extend to 5 million (also free!). However, Western and U.S. enterprises may find it more beneficial from a security and geopolitics standpoint to download the models off the AI code sharing website Hugging Face or Github, we’re they’re also freely available, and run them locally on their own hardware or in virtual private clouds/U.S. cloud providers.
Already, one commentator says it rivals OpenAI’s flagship proprietary model, GPT-5 — yet is available entirely for free.
Together, these releases form a cohesive ecosystem that spans research innovation, deployment efficiency, and broad accessibility.
For enterprises in China, the U.S., and around the globe, it provides yet another powerful couplet of open source LLMs to consider deploying in their own operations.
What is Meituan and Why Is It Moving From Food Delivery Into LLMs?
Meituan, founded in March 2010 by serial entrepreneur Wang Xing, has evolved from a Groupon-style deals site into one of China’s dominant “super apps” bridging local services, retail, and logistics.
The company merged with Dianping in 2015 to solidify its reach in consumer reviews and local services, before reverting to the “Meituan” brand in 2020. Headquartered in Beijing, Meituan is publicly traded on the Hong Kong Stock Exchange under ticker 3690.HK and is a component of the Hang Seng Index.
Today, Meituan’s business spans food delivery, in-store services, instant retail (groceries and local consumer goods), hotel and travel booking, and more. It claims a user base of over 770 million annual transacting users and supports more than 14.5 million merchants on its platform.
Its scale and logistics infrastructure give it a near-monopoly in many urban delivery corridors. Financially, Meituan has faced deep margin compression and a dramatic profit slide amid fierce domestic competition. The company has also publicly committed to investing “billions” into AI and chip capabilities as it shifts toward more technology-driven offerings.
The Initial Launch: LongCat-Flash
That aforementioned AI investment and focus is already bearing fruit: on August 30, Meituan released LongCat-Flash, an open-sourced large language model built with a Mixture-of-Experts (MoE) architecture.
Despite a nominal parameter count of 560 billion, LongCat’s design dynamically activates only 18.6-31.3 billion parameters per token (averaging ~27 billion), balancing scale and efficiency.
It also leverages innovations like “zero-computation experts” and a shortcut-connected MoE (ScMoE) structure to manage communication overhead and optimize large-scale training. More on these below:
* Zero-computation experts in MoE blocks that allow the system to allocate compute budget only where needed.
* Shortcut-connected MoE (ScMoE): overlapping computation and communication to ease scaling bottlenecks.
* PID-controlled expert bias: maintaining stable average activation across tokens.
* Training efficiency mechanisms: including hyperparameter transfer from proxy models, deterministic computation to guard against silent data corruption, and a refined initialization strategy.
In benchmarking, LongCat-Flash-Chat performs competitively or superiorly across instruction following, reasoning, code generation, and agentic tasks.
Flash-Chat achieved competitive scores against established proprietary systems. It reached 89.7% on MMLU, 96.4% on MATH500, and 73.2% on GPQA-diamond, while excelling in agentic tool use with results such as 73.7% on τ²-Bench Telecom, outperforming several larger closed models. Importantly, inference speed topped 100 tokens per second, balancing performance with deployment practicality.
For Meituan, LongCat signals a push to embed advanced AI capabilities into its core commerce, logistics, and platform services — and to position itself not just as a delivery giant, but as a frontier tech player in China’s AI race.
Deployment Report: Optimizing for Scale
Two days later, on September 1, 2025, Meituan released a detailed technical report focused on deploying LongCat-Flash at scale using SGLang. The report addressed the dual challenges of throughput and latency in large MoE models, noting that traditional strategies often forced trade-offs between the two.
Key deployment innovations included:
* PD Disaggregation: separating the prefilling and decoding phases, reducing time-to-first-token for interactive workloads.
* Single Batch Overlap (SBO): a four-stage execution pipeline that overlaps communication-heavy operations (all-to-all dispatch and combine) with dense computation, hiding latency within a single batch.
* Wide Expert Parallelism: increasing parallelism and batch sizes while leveraging overlap to offset communication overhead.
* Multi-step overlapped scheduling: boosting GPU utilization by fusing multiple forward passes into a single scheduling cycle.
* Multi-Token Prediction (MTP): employing a lightweight dense head and fusing verification kernels to optimize speculative decoding.
The result was a system capable of running at over 100 tokens per second on NVIDIA H800 clusters, with cost per token less than half of some smaller peers. By bridging throughput and latency, Meituan made LongCat-Flash deployable in environments where both efficiency and responsiveness are critical.
Now, Advancing Reasoning Arrives with LongCat-Flash-Thinking
The most significant leap arrived on September 23, 2025, with the unveiling of LongCat-Flash-Thinking, a variant of the Flash model family tailored for complex reasoning.
Where Flash-Chat was designed as a strong general-purpose foundation, Flash-Thinking specialized through a two-phase training process, according to Meituan’s technical report on the new model:
Benchmarks and Performance
Evaluations place LongCat-Flash-Thinking among the strongest open-source reasoning models to date, often approaching or even surpassing the performance of proprietary systems such as GPT-5-Thinking, OpenAI-o3, and Gemini 2.5-Pro.
In mathematical reasoning, Flash-Thinking delivers standout results. It reaches 99.2 percent on MATH500, effectively tied with GPT-5 and ahead of both OpenAI-o3 and Gemini 2.5-Pro.
On VitaBench, which measures reasoning over biomedical and life-science knowledge, Flash-Thinking scores 29.5, narrowly edging out GPT-5 at 29.3.
On competition-style benchmarks like AIME25 and HMMT25, it trails GPT-5 slightly but still ranks ahead of most other peers, underscoring its ability to handle difficult quantitative problems efficiently.
The model also shows strength in general reasoning. On the GPQA-Diamond benchmark, it lands close to GPT-5 and OpenAI-o3, while on ARC-AGI it pushes past both OpenAI and Gemini, even if it does not yet reach GPT-5’s higher score.
Coding tasks reveal a similar pattern: Flash-Thinking achieves 79.4 percent on LiveCodeBench, placing it just a point below GPT-5 and ahead of other leading models, with OJBench results also showing it competitive at the top tier.
Performance in agentic tool use is more uneven. Flash-Thinking posts high results on the τ²-Bench series, including 83.1 percent on Telecom — the best among open-weight systems — though GPT-5 maintains a clear lead in this category. Still, the results suggest strong progress in tool-augmented reasoning and adaptability to real-world agentic tasks.
Where Flash-Thinking truly distinguishes itself is in formal theorem proving. On MiniF2F, it reaches 81.6 percent at pass@32, far outpacing GPT-5’s 51.2 percent and setting a new state of the art among open models. Even under stricter evaluation thresholds, it maintains a commanding lead, showing that the domain-parallel reinforcement learning strategy translates into tangible advantages for logic-heavy tasks.
Safety scores further highlight its capabilities. Flash-Thinking achieves 93.7 percent on harmful content filtering and 93.0 percent on misinformation detection, both far higher than GPT-5 and its peers, while maintaining comparable privacy protection. These results suggest that the model not only matches proprietary systems in core reasoning domains but also demonstrates stronger safeguards against misuse.
Together, these benchmarks paint a picture of a model that holds its own against GPT-5 in mathematics and coding, surpasses it in theorem proving and safety, and narrows the gap in broader reasoning and agentic tool use.
An important efficiency result showed Flash-Thinking reducing token consumption by 64.5% on AIME-25 compared to baseline reasoning models, cutting average tokens from 19,653 to 6,965 without lowering accuracy.
Opening Access: API and Developer Quotas
Alongside the technical release, Meituan announced expanded API access for Flash-Thinking, making it usable beyond research downloads. Developers can now:
* Access the model with a daily free quota of 500,000 tokens, up from the earlier 100,000.
* Apply for 5 million tokens per day for higher-volume needs, subject to approval.
* Integrate with Claude Code configurations, expanding compatibility with established coding environments.
* Use updated documentation, including a Quick Start guide, change log, and FAQ.
The API platform complements the Hugging Face distribution of model weights and the GitHub repositories for both Flash and Flash-Thinking, creating multiple onramps for developers at different levels of scale.
Building the LongCat Ecosystem
Taken together, the August and September releases illustrate Meituan’s systematic approach:
* Foundation: LongCat-Flash-Chat, establishing a performant, efficient base model trained on massive data.
* Engineering: Deployment optimizations via SGLang, making large-scale MoE models practical for interactive and batch workloads.
* Reasoning specialization: LongCat-Flash-Thinking, pushing state-of-the-art in open-source reasoning, theorem proving, and agentic tool use.
* Accessibility: API launch with generous free quotas, encouraging developer experimentation and adoption.
Each stage built upon the last, moving from research release to deployment strategy to specialized training and finally to open access.
What It Means For Enterprise Technical Decision-Makers
Meituan frames LongCat as an open ecosystem under the MIT license, while cautioning that these models are not tuned for every downstream application.
Developers are urged to assess safety, fairness, and domain fit before deploying in sensitive settings.
Still, with open weights, technical transparency, system-level deployment strategies, and broad API availability, LongCat now represents one of the most comprehensive open-source efforts at the frontier of large-scale reasoning models.
For AI engineers responsible for managing the lifecycle of large language models, reasoning-specialized systems like Flash-Thinking have direct impact on daily workflows, cutting token usage by more than half without compromising accuracy. That means fewer resources spent on inference, faster iteration during prototyping, and the ability to meet business demands without scaling up infrastructure costs dramatically. In practice, it means less time firefighting compute bottlenecks and more time aligning models with specific product needs.
For those working on AI orchestration and operational pipelines, these models can slot more easily into existing multi-model environments while still offering top-tier reasoning ability. Their value lies in stability and scalability. The domain-parallel reinforcement learning strategy used in Flash-Thinking points to more predictable model behavior across specialized tasks, which can reduce the complexity of pipeline validation and monitoring. Combined with deployment optimizations from the Flash family — such as single batch overlap and wide expert parallelism — these advances align with orchestration specialists’ priorities: reliable throughput, minimized latency, and budget-conscious scaling.
Meanwhile, data engineers and IT security leaders see different but complementary benefits. For data-focused roles, efficiency in handling logic-heavy reasoning and structured tasks makes it easier to integrate language models into pipelines that depend on accurate transformations or quality assurance.
For security functions, the high safety scores achieved by Flash-Thinking — particularly in harmful content filtering and misinformation detection — offer reassurance that integrating such models into enterprise systems does not increase organizational risk.
By weaving together architectural efficiency, reinforcement learning at scale, and accessible developer tools, Meituan positions the LongCat family as both a research contribution and a practical platform — one that bridges cutting-edge reasoning with real-world usability.