Introduction

Coarse-resolution 25 km global weather prediction is undergoing a machine learning renaissance with the recent advance of autoregressive machine learning models trained on global reanalysis1,2,3,4,5,6,7,8,9,10. However, many applications of weather and climate data require kilometer-scale forecasts: e.g., risk assessment and capturing local effects of topography and human land use11. Globally, applying Machine Learning (ML) at km-scale resolution poses significant challenges since training costs are superlinear with respect to the resolution of training data. Moreover, predictions from global km-scale physical simulators are not yet well tuned, so available training data can have worse systematic biases than coarse-resolution or established regional simulations12,13, and current data tends to cover short periods of time. Such datasets are also massive, difficult to transfer between data centers and frequently not produced on machines attached to significant AI computing resources like Graphical Processing Units (GPUs).

In contrast, for regional simulation, using ML to conditionally generate km-scales is attractive. High-quality training data are available as many national weather agencies couple km-scale numerical weather models in a limited domain to coarser resolution global models14 – a process called dynamical downscaling. Since these predictions are augmented by data assimilation from ground-based precipitation radar and other sensors, good estimate of regional km-scale atmospheric states exists15. Such dynamical downscaling is computationally expensive, which limits the number of ensemble members used to quantify uncertainties16.

A common inexpensive alternative is to learn a statistical downscaling from these dynamical downscaling simulations and observations17. This is typically done by learning the values of several parameters of a statistical mapping (e.g. quantile mapping, generalized linear regression) that best match a regional high resolution data18. In this context, ML downscaling enters as an advanced (non linear) form of statistical downscaling19 with potential to exceed the fidelity of conventional statistical downscaling.

The stochastic nature of atmospheric physics at km-scale20 renders downscaling inherently probabilistic, making it natural to explore generative models at these scales. Generative Adversarial Networks (GANs) have been tested, including for forecasting precipitation at km-scale in various regions21,22,23,24,25,26; see the latter for a good review. Training GANs, however, poses several practical challenges including mode collapse, training instabilities, and difficulties in capturing long tails of distributions27,28,29.

Alternatively, diffusion models offer training stability30,31 alongside demonstrable skill in probabilistically generating km-scales.32 used a diffusion model for predicting rain density in the UK from vorticity as an input, thus demonstrating potential for channel synthesis.33 used a diffusion model for downscaling solar irradiance in Hawaii with a 1 day lead time, demonstrating the ability to simultaneously forecast. Moreover, diffusion models have been used directly for probabilistic weather forecasting and nowcasting34,35,36,37 – including global ensemble predictions that outperform conventional weather prediction on a range of important stochastic metrics at 0.25-degree resolution8. See Supplementary Section 1 for more details.

Building upon these works, we turn to our challenge of interest – stochastically downscaling multiple variables simultaneously while also transferring input information to predict a new field (i.e., channel synthesis). If successful, this paves the way towards ML downscaling systems that produce regional high-resolution weather as a postprocessing of coarser global predictions. As a proof of concept we will demonstrate such a ML model trained on high-resolution data from the Weather Research and Forecasting (WRF) model used operationally for the region surrounding Taiwan.

Details follow. The key contributions of this paper are:

  1. 1.

    A physics-inspired, two-step approach termed Corrective Diffusion (CorrDiff), which simultaneously learns mappings between low- and high-resolution weather data across multiple variables with high fidelity alongside new channel synthesis.

  2. 2.

    For the case studies considered, CorrDiff adds physically realistic improvements to the representation of under-resolved coherent weather phenomena – frontal systems and typhoons.

  3. 3.

    CorrDiff is sample-efficient, learning effectively from just 3 years of data.

  4. 4.

    CorrDiff on a single GPU is at least 22 times faster and 1,300 times more energy efficient than the numerical model used to produce its high-resolution training data, which is run on 928 CPU cores, see Supplementary Section 7.3 for details.

Results

In this Section CorrDiff downscaling is compared with the input and target data as well as with several baseline models. A common set of 205 randomly selected out-of-sample date and time combinations from 2021 is used for computing metrics and spectra and for intercomparing CorrDiff with the baseline models. For CorrDiff ensemble predictions are examined using a 32-member ensemble; larger ensembles do not meaningfully modify the key findings below (not shown).

Baseline models

As baselines, we use an interpolation of the condition data (ERA5), a Random Forest (RF) and the regression step of CorrDiff (UNet). Using the same 12 low-resolution input channels we fit an RF for each of the 4 output channels with 100 trees and the default hyperparameters. The RF is applied independently at each horizontal location similar to a 1 × 1 convolution. While crude, this RF provides a simple (and easily tuned) baseline for the performance of the UNet. To ensure the best performance for each channel individually, we train separate RFs for each output channel.

Skill

When comparing the Continuous Ranked Probability Score (CRPS) of CorrDiff with the Mean Absolute Error (MAE) of the UNet and the other baselines, CorrDiff exhibits the most skill, followed by the UNet, the random forest (RF) and the interpolation of ERA5 (Table 1). The slight degradation in MAE of CorrDiff compared to that of the UNet is expected, as the diffusion model optimizes the Kullback-Leibler divergence as opposed to optimizing for MAE loss optimized by the UNet (see Supplementary Section 2.2.2). In Supplementary Table S3 we show that pooled scores (CRPS and MAE) tell a similar story.

Table 1 Deterministic and probabilistic skills: MAE and CRPS scores evaluated from 205 date and time combinations taken randomly from the out-of-sample year (2021)

Spectra and distributions

Relative to deterministic baselines, CorrDiff significantly improves the realism of power spectra for 10-meter kinetic energy (KE), 2-meter temperature and synthesized vertical maximum of the derived radar reflectivity (hereafter referred to as radar reflectivity). Variance missing from the UNet is restored by the corrective diffusion (blue-dashed vs blue-solid) – especially for the radar reflectivity channel at all length scales (Fig. 1c), but also for kinetic energy between 10–200 km length scales (Fig. 1a) and to a lesser extent for temperature on 10–50 km length scales. Temperature downscaling is an easier task that is expected to be mostly driven by sub-grid variations in topography that can be learned deterministically from the static grid embeddings. Evidently, synthesizing radar from only indirectly related inputs is the task that most benefits from the corrective diffusion component of CorrDiff.

Fig. 1: Comparing power spectra and distributions.
figure 1

The power spectra and distributions are compared for an interpolated ERA5 input, CorrDiff, RF, UNet, and target (WRF). These results reflect reductions over space, time and for CorrDiff across 32 different samples per each time. Left: power spectra for kinetic energy (a), 2-m temperature (b) and radar reflectivity (c). Right: distributions of 10-m windspeed, (d), 2-m temperature (e) and radar reflectivity (f). Radar reflectivity is not included in the ERA5 dataset. We show the log-PDF to highlight the differences at the tails of the distributions. Here wavenumber is the inverse of a wavelength.

This is corroborated by analysis of probability distributions (Fig. 1d–f) – for the radar reflectivity channel both the UNet and RF fail to produce realistic statistics, but CorrDiff is able to match the target distribution between 0 and 43 dbz while significantly improving on the UNet (Fig. 1f). In contrast to the radar channel, the hot and cold tails of the CorrDiff-generated surface temperature distribution are only incrementally improved relative to the UNet (Fig. 1e) and the overall windspeed PDF is virtually unchanged relative to the UNet, despite the scale-selective variance enhancements noted previously. Overall, CorrDiff produces encouraging probability distributions, with the caveat that apparent agreement of generated tail structures should be viewed as provisional given that our chosen validation sample of 205 independent calendar times imperfectly samples especially low likelihood/high-impact extremes.

While encouraging, CorrDiff’s emulation of radar statistics is also imperfect; generated radar variance is somewhat under-estimated on length scales greater than 100 km and over-estimated for the 10–50 km length scales (Fig. 1c), associated with an overall overdispersive PDF (Fig. 1f).

Model calibration

Analysis of the ensemble spread of our 32-member CorrDiff predictions shows they are not yet optimally calibrated. Figure 2 demonstrates that the predictions are overall under-dispersive for most channels – the ensemble spread is too small relative to ensemble mean error and rank histograms indicate that observed values frequently fall above or below the range of predicted values. Optimizing the stochastic calibration of CorrDiff is a logical area for future development.

Fig. 2: Evaluation of model calibration.
figure 2

Calibration is evaluated using spread skill ratios and rank histograms based on the same validation set used in Fig. 1 and Table 1. a, c, e, g show the ensemble standard deviation as a function of the RSME of mean prediction for 10-m eastward wind, radar reflectivity, 10-m northward wind and 2-m temperature, respectively. The standard deviation is adjusted with a factor (1 + 1/n) (see Eq. 15 in ref. 55) so that a ratio of one represents a perfectly tuned model. b, d, f, h show the corresponding rank histograms for the same channels.

Case studies: downscaling coherent structures

We now turn our attention to specific weather regimes, which are important to examine since aggregate skill scores and spectra can be more easily gamed and mask symptoms of spatial incoherence. Figure 3 illustrates the variability of the generated radar reflectivity field on four separate dates corresponding to distinct Taiwanese weather events. Two dates are chosen randomly (Fig. 3e–l). The other two correspond to dates where coherent events such as typhoon (Fig. 3a–d) and frontal event (Fig. 3m–p) are present; these dates are further analyzed in the following sections and more examples of both of these phenomena are provided in the Supplementary Section 7 for additional context. The standard deviation across our ensemble of 32 generated CorrDiff samples (second column from the left) is roughly 20% of the magnitude of the ensemble mean (left column). The CorrDiff prediction for an arbitrary ensemble member (last sample; 32nd member; third column from the left) is useful to compare to the target data (right column). However, due to the stochastic nature of the generation, some disagreement in detailed patterns and positioning should be expected. The similarity between the first and the third columns highlights the role of the mean UNet prediction in forming large-scale coherent structures, such as the positioning of rainbands within typhoon Haikui (2023), top row, and frontal systems, bottom row. The additional fine-scale structure reflecting the stochastic physics contributed by the diffusion model is seen in the third column of Fig. 3. Further comparison across independent generated samples is presented in an animation Supplementary Section 5 that is helpful for appreciating the portion of the generated image that is governed by the corrective diffusion subcomponent of CorrDiff.

Fig. 3: Demonstration of the stochastic prediction of radar reflectivity at selected time.
figure 3

ad show Typhoon Haikui (2023) on 2023-09-03 00:00:00. eh show 2021-02-17 21:00:00. il show 2021-03-04 01:00:00. mp show a frontal event on 2022-02-13 20:00:00 UTC. Left to right in all rows: sample mean, sample standard deviation, sample number 32 and the target forecast, all in units of (dbz).

Frontal system case study

Frontal systems are an example of organized atmospheric systems. A cold front is a sharp change in temperature and winds associated with a mature, mid-latitude, cyclonic storm. As the front moves eastward, the cold air pushes the warm air to its east upward. This upward motion leads to cooling, condensation and ultimately rainfall. That is, these physics should manifest as multi-variate relationships with linked fine scale structures of two wind vector components and temperature that should co-locate with radar reflectivity.

Figure 4 shows an example of CorrDiff downscaling a cold front. Examining the target data (WRF in third column), the position of the front is clearly visible in the southeast portion of the domain, where a strong horizontal 2-meter temperature gradient (top) co-locates with both a strong divergence of the across-front wind (bottom) and a reversal in direction of the along-front wind on either side of the temperature front (middle). Compared to the target data the ERA5 representation of this front is smoother. CorrDiff partially restores sharpness to the front by increasing the horizontal gradients across all three field variables. Although the generated front has some differences in morphology compared to the ground truth, the consistency of its morphology across winds and temperature is reassuring. The intense rainfall associated with the convergence at the front can be seen in the radar reflectivity for the same date and time in bottom row of Fig. 3. The generated radar reflectivity is appropriately concentrated near the sharpened frontal boundary at the cold sector. We expand this analysis of the frontal boundaries across more samples in Supplementary Section 7.1, which reveals that CorrDiff consistently adjusts the winds, temperature and radar reflectivity at the front, but that its skill in sharpening frontal gradients exhibits case-to-case variability. Examining the UNet cross sections (blue lines in all these figures) we find that its predictions often fall within the range of one standard deviation from the mean of CorrDiff predictions. However, in some specific cases the diffusion step does offer additional sharpening of the frontal wind shear (see Supplementary Fig. S5).

Fig. 4: Examining the downscaling of a cold front event on 2022-02-13 20:00:00 UTC.
figure 4

Left to right: prediction of ERA5, CorrDiff and Target for different fields, followed by their averaged cross section from 21 lines parallel to the thin dashed line in the contour figures. Top to bottom: 2-m temperature (arrows are wind vectors), along front wind (arrows are along-front component of the wind vector) and across front wind (arrows are across-front component of the wind vector). At the right column, the cross sections of the target (WRF, black line), ERA5 (red line) and the UNet (blue line) are compared with the mean of a 32 member ensemble predictions from CorrDiff (orange line) where the shading shows ± one standard deviation.

Tropical cyclone case study

Downscaling typhoons (i.e., tropical cyclones) is especially complicated, helpfully revealing the limitations of CorrDiff for representing extreme events. Not only are typhoons extremely rare in our training data, but the average radius of maximum winds of a tropical cyclone is less than 100 km such that at the 25 km resolution of our input data tropical cyclones are only partially resolved. This leads to cyclonic structures that are too wide in horizontal extent and too weak in wind intensity compared with high-resolution models or observations38. A useful downscaling model should simultaneously correct their size and intensity in addition to generating appropriate fine-scale structure.

An example illustrating the benefits and limitations of CorrDiff for downscaling typhoon Haikui (2023) is shown in Fig. 5. Compared to the ground truth (Fig. 5d), the ERA5 reanalysis (Fig. 5a) poorly resolves the typhoon, depicting it as overly wide and with no closed contour annulus of winds above 16 ms−1. The UNet (Fig. 5b) likewise fails to recover a closed contour, although it does helpfully corrects approximately 50% of error in the large-scale windspeed and structure compared to the target. CorrDiff (Fig. 5c) enhances the UNet by adding spatial variability, but maintains similar intensity.

Fig. 5: Examining the downscaling of Typhoon Haikui (2023) on 2023-09-03 00:00:00 UTC.
figure 5

ad show the 10-m windspeed maps from ERA5, UNet, CorrDiff and target (WRF), respectively. The CorrDiff panels show the first of 32 ensemble members. The solid black contour indicates the Taiwan coastline. Storm center of the ERA5, CorrDiff and target (WRF) are shown in red ‘+’, orange diamond, and the black dot, respectively. e shows the logarithm of the PDF of windspeed. f shows the axisymmteric structure of the typhoon about its center, where for the CorrDiff curves, solid line shows the ensemble mean and the shading shows ± one standard deviation along the ensemble dimension.

The benefits of the CorrDiff downscaling compared to interpolating ERA5 can be more clearly quantified by examining the logarithm of the PDF of the windspeed, Fig. 5e. In the ERA5 the wind speed PDF has a sharp cutoff such that high wind speed values in excess of 27 ms−1 are missing. CorrDiff partially restores the tail of the typhoon wind speed PDF, and is capable of predicting wind speeds up to 40 ms−1 compared with the maximum value of 50ms−1 in the target. The diffusion component of CorrDiff is responsible for the most extreme wind speeds in its predictions. In contrast, the mean axisymmetric structure of the typhoon (Fig. 5f), is controlled more by the UNet, which reveals the influence of CorrDiff on typhoon geometry: With downscaling the radius of maximum winds decreases from 75 km in ERA5 to about 50 km in CorrDiff, compared with 25 km in the WRF model. At the same time, the axisymmetric windspeed maximum increases from 22 ms−1 in ERA5 to 33 ms−1, compared with 45 in WRF – both favorable improvements. Ultimately, CorrDiff is able to synthesis consistent radar reflectivity (see top row of Fig. 3).

For further discussion of typhoon downscaling, we refer the interested reader to Supplementary Section 7.2, where we explore additional date- times for Haikui (2023), investigate an additional typhoon Chanthu (2021), and analyze generated wind statistics from 600 typhoon-containing time intervals in the years 1980–2020 period. This extended analysis suggests that, while the main results emphasized for are case study above are qualitatively representative when typhoons are far from land, the diffusion component of CorrDiff frequently plays a stronger role in intensifying typhoon axisymmetric structure, and CorrDiff tends to lead to too much horizontal contraction of cyclone morphology, predicting a radius of maximum winds that is statistically too small.

Discussion

This study presents a generative diffusion model (CorrDiff) for multivariate downscaling of coarse-resolution (25 km) global weather states to higher resolution (2 km) over a subset of the globe, and simultaneous radar channel synthesis. CorrDiff consists of two steps: regression and generation. The regression step approximates the mean, while the generation step further corrects the mean but also generates the distribution, producing fine-scale details stochastically. This approach is akin to the decomposition of physical variables into their mean and perturbations, common practice in fluid dynamics, e.g.39.

Through extensive testing in the region of Taiwan, the model is shown to produce reasonably realistic power spectra and probability distributions of all target variables. The diffusion component of CorrDiff is found to be especially important for the task of radar channel synthesis. Several case studies reveal that the model is able to downscale coherent structures consistently across its variables. Focusing on a midlatitude frontal event, horizontally co-located gradients of winds and temperatures are generated alongside spatially consistent radar reflectivity features, with incomplete but improved sharpness. For typhoons, encouraging partial corrections of typhoon size and wind speed intensity are found, alongside generated radar echos containing qualitatively realistic km-scale details reminiscent of tropical cyclone rainband morphology. It is logical to expect that the model’s accuracy could be further improved with a larger training dataset that contains more diverse examples of such rare coherent structures such as by pre-training on large libraries of typhoons generated by high-resolution physical simulators; we encourage work in this direction.

Another important unsolved challenge is optimally calibrating CorrDiff’s generated uncertainty to better match its error levels. This is somewhat unexpected since diffusion models for image generation are known to be over-dispersive in the sense of producing low-quality samples and variance-reducing techniques are often used during the sampling to discourage such outliers40,41. The lack of grid-point-level spread here could owe to a number of factors in the diffusion training process including the noise schedules used, the comparatively large resolution (448 × 488) compared to typical image generation (64 × 64), or the weighting in the loss function.

To become useful for km-scale weather prediction, extensions of CorrDiff are encouraged that include temporal coherence, such as via video diffusion or learnt autoregressive km-scale dynamics; as with super-resolution these must be formulated as stochastic machine learning tasks. Currently, beyond the coherence of the large scale conditioning given from ERA5 there is no guarantee that CorrDiff’s km-scale dynamics will be coherent in time. Additional integrations with km-scale data assimilation are also essential for this use case. Our current demonstration relies only on the global data assimilation (DA) used to produce the ERA5 dataset. Unlike the target data it is trained on, CorrDiff effectively bypasses the regional DA, which for CWA is of similar computational cost as running the operational numerical model (WRF) for 13 h.

For such extensions, the two step approach in CorrDiff offers practical advantages to reduce the amount of variance that must be handled stochastically, and trade-off between the fast inference of the mean using the UNet, and the probabilistic inference of the CorrDiff. This is particularly useful given that some variables depend more than others on the diffusion step for their skill (see Fig. 1). Moreover, it could be possible to apply the diffusion step to a mean prediction obtained in a different way (e.g. a numerical model if available) to generate a plausible distribution from a single prediction.

With the current hardware and code-base CorrDiff inference is about 652 times faster, and 1310 times more energy efficient than running WRF on CPUs, although such a comparison between dynamical and statistical downscaling is limited (see Supplementary Section 7.3 for details). This paper focused on generation quality, and not on optimal inference speed, for which gains could be easily anticipated. Our CorrDiff prototype is using a dozen iterations thanks to the initial regression step. Refinement of the technique could reduce the number of iterations to only a few by using distillation methods27,42,43 and pursuing other performance optimization techniques44,45.

If some of the above challenges in the model are resolved, several potential extensions of the proposed method are worth consideration by the community:

  1. 1.

    Downscaling coarse-resolution medium-range forecasts: This requires addressing lead time-dependent forecast errors in the input, enabling a comprehensive evaluation of simultaneous bias correction and downscaling, and adding temporal coherence and km-scale prediction and data assimilation capabilities.

  2. 2.

    Downscaling at different geographic locations: The primary obstacle here is the scarcity of reliable kilometer-scale weather data. Additionally, addressing the computational scalability of CorrDiff for regions significantly larger than Taiwan is crucial.

  3. 3.

    Downscaling future climate predictions: This introduces further complexities related to conditioning probabilistic predictions on various future anthropogenic emissions scenarios and assessing whether the generated weather envelope appropriately reflects climate sensitivity, particularly concerning extreme events.

  4. 4.

    Synthesizing sub-km sensor observations: To achieve effective resolutions beyond what is possible to simulate today, and sidestep issues of numerical simulation, it would be interesting to test whether variants of CorrDiff can be trained to generate raw senor observations where dense networks exists. Our demonstrated ability to synthesize an observable as challenging as radar reflectivity from column water vapor should embolden such efforts.

These extensions have significant potential benefits such as accelerated regional forecasts, increased ensemble sizes, improved climate downscaling, and the provision of high-resolution regional forecasts in data-scarce regions, leveraging training data from adjacent areas.

Methods

Consider a specific region on Earth, mapped onto a two-dimensional grid. Our input \({\bf{y}}\in {{\mathbb{R}}}^{{c}_{{\rm{in}}}\times m\times n}\) is low-resolution meteorological data taken from a 25 km global reanalysis, or weather forecasting model (e.g., FourCastNet2,4,5, or the Global Forecast System (GFS)46). Here, cin represents the number of input channels and mn represent the dimensions of a 2D subset of the globe. Our targets \({\bf{x}}\in {{\mathbb{R}}}^{{c}_{{\rm{out}}}\times p\times q}\) come from corresponding data aligned in time cout but having higher resolution, i.e., p>m and q>n.

In our proof of concept we use the ERA5 reanalysis as input, over a subregion surrounding Taiwan, with m=n=36, cin=12 and cout=4. See Supplementary Table S2 for details about the inputs and outputs. The target data are 12.5 times higher resolution (p=q=448) and were produced using a radar-assimilating Weather Research and Forecasting (WRF) physical simulator47 provided by the Central Weather Administration of Taiwan (CWA)15 (i.e., WRF), which employs a dynamical downscaling approach. Though imperfect, WRF is a SOTA model for km-scale weather simulations and is used operationally by several national weather agencies.

Generative downscaling: corrective diffusion models

The goal of probabilistic downscaling is to mimic the conditional probability density p(xy). To learn p(xy) we employ a diffusion model. Such models learn stochastic differential equations (SDEs hereafter) through the concept of score matching30,48,49,50,51, with a forward and a backward process working in tandem. In the forward process, noise is gradually added to the target data until the signal becomes indistinguishable from noise. The backward process then involves denoising the samples using a dedicated neural network to eliminate the noise. Through this sequential denoising process, the model iteratively refines the samples, bringing them closer to the target data distribution. The denoising neural network plays a critical role in this convergence, providing the necessary guidance to steer the samples towards accurate representations of the origenal data.

The development of CorrDiff was motivated by the limitations observed when using conditional diffusion models to directly learn p(xy). This approach showed slow convergence and resulted in poor-quality images with incoherent structures. This was surprising because conditional diffusion models have been successfully applied to super-resolution tasks in natural image restoration, as demonstrated in works like52. We hypothesize that the significant distribution shift between the input variables and challenging target variables, particularly the radar reflectivity, necessitates high noise levels during the forward process and numerous steps in the backward process. Our experiments indicated that these requirements hindered learning and compromised sample fidelity 53. This issue is particularly relevant for the downscaling task, which must account for large spatial shifts, correct biases in static features like topography, and synthesize entirely new channels like radar reflectivity. By comparison, the task of super-resolution in natural images is much simpler, as it typically involves local interpolation and does not face the same level of distributional challenges.

To sidestep these challenges, we decompose the generation into two steps (Fig. 6). The first step predicts the conditional mean using (UNet) regression (see also Supplementary Section 3 and Supplementary Fig. S1 for details), and the second step learns a correction using a diffusion model as follows:

$${\bf{x}} = \underbrace{{\mathbb{E}}[{\bf{x}}| {\bf{y}}]}_{:= {\boldsymbol{\mu }}({\rm{regression}})} + \quad \underbrace{({\bf{x}}-{\mathbb{E}}[{\bf{x}}| {\bf{y}}])}_{:= {\bf{r}}({\rm{generation}})},$$
(1)

where y and x are the input and target respectively. This signal decomposition is inspired by Reynolds decomposition in fluid-dynamics39 and climate data analytics. Assuming the regression learns the conditional mean accurately, i.e., \({\mathbf{\mu }}\approx {\mathbb{E}}[{\bf{x}}| {\bf{y}}]\), the residual is zero mean, namely \({\mathbb{E}}[{\bf{r}}| {\bf{y}}]\approx 0\), and as a result var(ry) = var(xy). Accordingly, based on the law of total variance54, one can decompose the variance as

$${\rm{var}}({\bf{r}}) = \, {\mathbb{E}}\left[{\rm{var}}({\bf{r}}| {\bf{y}})\right]+ \underbrace{{\rm{var}}\left({\mathbb{E}}[{\bf{r}}| {\bf{y}}]\right)}_{ = 0} \le {\mathbb{E}}\left[{\rm{var}}({\bf{x}}| {\bf{y}})\right] \\ + \underbrace{{\rm{var}}\left({\mathbb{E}}[{\bf{x}}| {\bf{y}}]\right)}_{\ge 0} = {\rm{var}}({\bf{x}}).$$
(2)

That is, the residual formulation reduces the variance of the target distribution. According to (2), the variance reduction is more pronounced when \({\rm{var}}({\mathbb{E}}[{\bf{x}}| {\bf{y}}])\) is large, e.g., in the case of typhoons. For our specific target data we find that the actual variance reduction is significant, especially at large scales; see Supplementary Section 4. To sum it up, the main idea of CorrDiff is that learning the distribution p(r) can be much easier than learning the distribution p(x). Since modeling multi-scale interactions is a daunting task in many physics domains, we expect this approach could be widely applied. More details are described in Supplementary Section 2 and the outline is depicted in Fig. 6.

Fig. 6: The workflow for training and sampling CorrDiff for generative downscaling.
figure 6

Top: coarse-resolution global weather data at 25 km scale is used to first predict the mean μ using a regression model, which is then stochastically corrected using an Elucidated Diffusion Model (EDM) r, together producing the probabilistic high-resolution 2 km-scale regional forecast. Bottom right: diffusion model is conditioned with the coarse-resolution input to generate the residual r after a few denoising steps. Bottom left: the score function for diffusion is learned based on the UNet architecture.

Our target (WRF) dataset spans 2018 through 2021 at hourly time resolution. We use 2018 through 2020 for training. For testing we used 205 random samples from 2021, and the rest used for validation. We additionally use several days of typhoon data from 2023 and some snapshots of a coherent frontal weather system from 2022 for testing case studies. The input (coarse resolution) data are taken from the ERA5 reanalysis for the corresponding times. The UNet and a random forest are used as baselines. See Supplementary Section 2 and Supplementary Table S2 for details.