刘军：统计和数据科学是AI落地的一把钥匙-清华大学刘军院士课题组

Synopsis

From October 27th to 30th, 2025, the 2025 Financial Street Forum Annual Conference was held in Beijing's Financial Street. The FinTech Conference, as a specific segment of the forum, was held concurrently. On the morning of October 30th, the parallel forum "AI+Finance Special Forum" under the 2025 FinTech Conference, jointly organized by the Financial Committee of the Beijing Committee of the China National Democratic Construction Association, the Capital University of Economics and Business Beijing Digital Economy Development Research Institute, and the Zhongguancun FinTech Industry Development Alliance, was held at the Zhongguancun FinTech Characteristic Industrial Park under the theme "Intelligence Leading the Future of FinTech Innovation". Professor Liu Jun, Xinghua Distinguished Chair Professor and Chair of the Department of Statistics and Data Science at Tsinghua University, attended the forum and delivered a keynote speech titled "Reflections on AI Implementation in the Digital Economy and Finance". Professor Liu Jun argued that "the implementation of artificial intelligence must be deeply integrated with statistics and data science," noting that "sole reliance on big data without rigorous statistical thinking may lead to model bias and decision-making errors," and that "statistics and data science are the keys to AI implementation." Through multiple cases in urban governance, financial risk control, and others, he demonstrated how "AI + Statistics" can provide reliable, interpretable solutions for the digital economy and finance.

Liu Jun, Xinghua Distinguished Chair Professor and Chair of the Department of Statistics and Data Science, Tsinghua University

The following is compiled from the transcript:

Thank you to Professor Li Ping for the invitation. It's an honor to participate in this seminar. Having just listened to the profound insights from Mayor Sima and Minister Li on the framework for FinTech development, I am greatly inspired. I will supplement from a technical perspective, discussing some underlying logic to facilitate the practical implementation of AI in the digital economy and financial sectors.

I. The Four Foundational Ideas of AI: Understanding Large Models from the Underlying Logic

To understand how AI is implemented, we must first clarify the evolution of its core technologies to comprehend how the technology iterates. I often use this analogy: you can imagine a large AI model as a Boeing aircraft. Statistics and machine learning are like fundamental disciplines such as aerodynamics and thermodynamics that directly support the aircraft. The development of the next-generation aircraft cannot focus solely on improving the windows or making the seats more comfortable. Aircraft manufacturing involves many fields of engineering and economic considerations. We are currently experimenting with various specific Agents, but of course, we also need to understand their working principles. Therefore, I have summarized four foundational logics in AI development:

1. Deep Learning Models: Starting from Linear Regression

The most primitive AI idea is linear regression—using x to infer y, which is essentially "prediction" in statistics. The linear model proposed by Gauss centuries ago is the foundation of statistics. Nonlinear models developed later, but the core remains "using simple functions to fit a complex world."

Although neural networks were developed early on, the use of very large networks truly began with the emergence of Deep Neural Networks (DNN), which is a development of the last two decades. The development of large models has prompted reflection on some original scientific intuitions. Previously, the scientific community generally believed that simpler models were better, that understanding something should be simple and direct, so people also hoped to make deductions based on very simple models. However, the emergence of large models challenged this thinking. Using a model with a vast number of parameters can also describe the whole subject very well, and often even better. The emergence of large models has simultaneously given rise to many statistical learning methods, such as contrastive learning, mixture of experts, and self-supervised learning. But what exactly are large models? How can we build them well? How can they achieve generalization? These are all questions awaiting exploration. Many are aware that large models have issues like hallucination while also exhibiting characteristics like emergent abilities. But how exactly do these characteristics arise? Why do they appear? Some view it as metaphysics, believing machines are gradually gaining intelligence. I don't subscribe to that; I think it's nonsense. In fact, similar phenomena exist in physics and mathematics. A theory called Phase Transition attempts to explain such phenomena, like how water turns to ice—how water suddenly becomes ice at a certain temperature, or ice becomes water. The reasoning capabilities of large models may exhibit similar phenomena, and some mathematicians are researching this.

2. Foundation of Large Language Models: Embedding Representations of Words and Text in Euclidean Space

Another revolutionary idea is transforming discrete textual data into vectors, representing the distance between two words via spatial vectors. For example, the words "father" and "mother" have a spatial distance; we consider them closely connected. But sometimes "father" and "emperor" also have some connection, though it's less clear how to place them together. This is called deep knowledge representation (Embedding), a very revolutionary concept.

This approach is the foundation of all models like DeepSeek and ChatGPT—the so-called Next Token Prediction—because it allows extending from words to sentences or short texts. Transforming vocabulary into spatial vectors enables the implementation of Next Token Prediction using machine learning methods. This is a fundamental framework, and it also suggests that the value of large models extends far beyond conversational interaction. The seemingly "easily understandable" generated results are merely the tip of the iceberg. What is truly worth exploring are the deep knowledge representations (Embeddings) dynamically generated during the underlying training process. Organically integrating these into the model internally or calling upon them externally may yield better results. This "seeking value from the underlying layers" approach can often unlock more precise and efficient application potential. For example, companies using AI for hedge funds employ this idea; they never use existing models as-is for this task but rather dismantle them and apply the components to specific problems.

3. Stable Diffusion: The Foundation of Generative AI

The core idea of Stable Diffusion can be understood through an intuitive analogy: it's like a "reverse noise-addition experiment." We first gradually add random noise to a clear image until it completely turns into chaotic noise. The model's training objective is to reversely learn "how to restore the original image step-by-step from this noise."

Specifically, this process is achieved through iterative noise-addition and denoising training. The model first learns the forward transformation "from clear image to noise" (akin to "adding noise" to the image), then uses deep neural networks to reversely deduce the generative path "from noise to image." Each noise addition is a "destruction" of the image information, and the model's task is to capture the pattern of this destruction, ultimately mastering the ability to "reconstruct order from chaos"—this is the underlying logic of generative models.

This iterative learning framework of "noise addition and denoising" has become the cornerstone of almost all generative models (e.g., image generation, text generation). Its ingenuity lies in allowing the model to master the underlying laws of generating high-quality content from the bottom up, without directly learning "what a good image looks like," by simulating the process of "destruction and reconstruction." This is a remarkable idea, and now essentially all generative models use this framework.

4. Reinforcement Learning

Regarding AlphaGO, strategies like this are also used in finance, equivalent to conversational model fine-tuning. You can first have a strategy, then use the model to deduce outcomes, and then optimize the strategy based on those results. Reinforcement learning follows this logic. However, the most important point is that during forward search, some random Monte Carlo search is needed; it's not necessary to cover all scenarios.

II. Regarding Large Model Applications: Pay Attention to Data Fallacies

The above four points are relatively core ideas, summarizing the underlying logic of current AI technology. I'd like to add one point requiring attention: the use of data. Even small datasets need to be carefully selected because data often contains bias. Here, we are not saying large models themselves have problems, but we must emphasize data quality in their application; otherwise, the conclusions drawn may be biased.

There's an example: as you may know, Google has been continuously developing large models for years. A very high-profile work was published in Nature, with the core being using user search information to predict flu outbreaks. For the wealthiest, largest internet company, this shouldn't have been a very difficult task, but they also made some basic errors. For example, letting signals related to seasons dominate the prediction of flu, effectively turning flu prediction into season prediction, leading to bias. Such errors masked by "big data" and "artificial intelligence" are often more隐蔽(concealed), more deceptive, but also more harmful and severe.

Overall, understanding data is a very important aspect of large model application.

III. AI + Statistics Empowering FinTech

Our department and I myself have made some attempts in applications.

First, digital governance. We assisted the General Administration of Customs with a supervision plan. Because their data volume is very small, directly using large models is difficult. This requires using very sophisticated Agents to accomplish the task, which includes many aspects like experimental design and analysis based on statistical principles. Ideas like these can also be applied in market regulation.

Second, risk control. In China, risk control is a very important goal and means. We use large models to integrate vast amounts of textual data, then use the text information to mine关联关系(correlations/connections) between enterprises. This employs some technical means, including stochastic analysis, Monte Carlo simulation, and graph neural networks for risk prediction, achieving intelligent risk control. We are currently trying this; it seems a very promising direction. Many banks are also using such methods to predict USD risk. Under current technological conditions, integrating more modal data could yield even better prediction results. Anti-money laundering is a crucial goal in financial regulation. Integrating statistical and AI technologies can help financial institutions identify risky funds.

Finally, mentioning a point regarding public management: including using generative AI to mine the needs of disadvantaged groups, and using some dynamic models for traffic congestion control.

Due to time constraints, to summarize a few final points: First, re-emphasizing that understanding data is of great significance for developing useful Agents for us. Also, our country places great importance on data security, but analyzing and mining data is also very important and should be further encouraged. Statistics and data science are the keys to AI implementation.

Welcome to Jun Liu's Website

刘军：统计和数据科学是AI落地的一把钥匙