🥭CUDA(C)磁态蒙特卡洛和传输矩阵多GPU并行计算分析

1. 使用英伟达GPU、大都会和并行回火算法模拟蒙特卡洛。 2. 使用兰佐斯算法计算传输矩阵特征值。 3. 使用 Suzuki-Trotter 公式归一化量子无序系统。 4. 算法模型特征:多CUDA线程,多GPU和多任务式并行计算。

🏈指点迷津 | Brief

🍁磁态分析角度

  • Python和MATLAB自旋玻璃投资组合神经网络广义方程

  • Python和C++及MATLAB低温磁态机器学习模型

🍪语言内容分比

🍇CUDA矩阵乘法二维块平铺实现

从数学上讲,给定一个广义矩阵乘法运算 D=AB+CD=A B+C,其中 DRm×n,ARm×k,BRk×n,CRm×nD \in R ^{m \times n}, A \in R ^{m \times k}, B \in R ^{k \times n}, C \in R ^{m \times n},矩阵可以分成更小的矩阵。

A=[A1,1dbm×dbkA1,2dbm×dbkA1,kdmm×dbkA2,1dmm×dbkA2,2dbm×dbkA2,k/dbkdbm×dbkAm/dmm,1dmb×dbkAm/dmm,2dbm×dbkAm/dbm,k/dbkdbm×dbk]A=\left[\begin{array}{cccc} A_{1,1}^{d_{b m} \times d_{b k}} & A_{1,2}^{d_{b m} \times d_{b k}} & \ldots & A_{1, k}^{d_{m m} \times d_{b k}} \\ A_{2,1}^{d_{m m} \times d_{b k}} & A_{2,2}^{d_{b m} \times d_{b k}} & \cdots & A_{2, k / d_{b k}}^{d_{b m} \times d_{b k}} \\ \vdots & \vdots & \ddots & \vdots \\ A_{m / d_{m m}, 1}^{d_{m b} \times d_{b k}} & A_{m / d_{m m}, 2}^{d_{b m} \times d_{b k}} & \cdots & A_{m / d_{b m}, k / d_{b k}}^{d_{b m} \times d_{b k}} \end{array}\right]

B=[B1,1dbk×dbnB1,2dbk×dbnB1,n/dbndbk×dbnB2,1dbk×dbnB2,2dbk×dbnB2,n/dbndbk×dbnBk/dbk,1dbkdbnBk/dbk,2dbk×dbnBk/dbk,n/dbndbk×dbn]B=\left[\begin{array}{cccc} B_{1,1}^{d_{b k} \times d_{b n}} & B_{1,2}^{d_{b k} \times d_{b n}} & \ldots & B_{1, n / d_{b n}}^{d_{b k} \times d_{b n}} \\ B_{2,1}^{d_{b k} \times d_{b n}} & B_{2,2}^{d_{b k} \times d_{b n}} & \ldots & B_{2, n / d_{b n}}^{d_{b k} \times d_{b n}} \\ \vdots & \vdots & \ddots & \vdots \\ B_{k / d_{b k}, 1}^{d_{b k} d_{b n}} & B_{k / d_{b k}, 2}^{d_{b k} \times d_{b n}} & \cdots & B_{k / d_{b k}, n / d_{b n}}^{d_{b k} \times d_{b n}} \end{array}\right]

C=[C1,1dbm×dbnC1,2dbm×dbnC1,n/dbndbm×dbnC2,1dbm×dbnC2,2dbm×dbnC2,n/dbndbm×dmnCm/dbm,1dbm×dlnCm/dbm,2dbn×dbnCm/dbm,n/dbndbm×dbn]C=\left[\begin{array}{cccc} C_{1,1}^{d_{b m} \times d_{b n}} & C_{1,2}^{d_{b m} \times d_{b n}} & \ldots & C_{1, n / d_{b n}}^{d_{b m} \times d_{b n}} \\ C_{2,1}^{d_{b m} \times d_{b n}} & C_{2,2}^{d_{b m} \times d_{b n}} & \ldots & C_{2, n / d_{b n}}^{d_{b m} \times d_{m n}} \\ \vdots & \vdots & \ddots & \vdots \\ C_{m / d_{b m}, 1}^{d_{b m} \times d_{l n}} & C_{m / d_{b m}, 2}^{d_{b n} \times d_{b n}} & \cdots & C_{m / d_{b m}, n / d_{b n}}^{d_{b m} \times d_{b n}} \end{array}\right]

D=[D1,1dbm×dbnD1,2dbm×dbnD1,n/dbndbm×dbnD2,1dbm×dbnD2,2dbm×dbnD2,n/dbndbm×dmnDm/dbm,1dbm×dbnDm/dbm,2dbm×dbnDm/dbm,n/dbndbn×dlm]D=\left[\begin{array}{cccc} D_{1,1}^{d_{b m} \times d_{b n}} & D_{1,2}^{d_{b m} \times d_{b n}} & \ldots & D_{1, n / d_{b n}}^{d_{b m} \times d_{b n}} \\ D_{2,1}^{d_{b m} \times d_{b n}} & D_{2,2}^{d_{b m} \times d_{b n}} & \ldots & D_{2, n / d_{b n}}^{d_{b m} \times d_{m n}} \\ \vdots & \vdots & \ddots & \vdots \\ D_{m / d_{b m}, 1}^{d_{b m} \times d_{b n}} & D_{m / d_{b m}, 2}^{d_{b m} \times d_{b n}} & \cdots & D_{m / d_{b m}, n / d_{b n}}^{d_{b n} \times d_{l_m}} \end{array}\right]

代码实现如下:

此 FP32 广义矩阵乘法实现的性能在 NVIDIA GeForce RTX 3090 GPU 上达到 2.66 TFLOPS,比之前的实现好很多。不过,它距离 GPU 的理论峰值性能还相差甚远。

使用二维块平铺和一维线程平铺实现

我们首先将矩阵 B 的数据从共享内存缓存到寄存器中。每个具有块线程索引 (tm,tn)\left(t_m, t_n\right)的线程,其中 tm[1,dtm/dtm]t_m \in\left[1, d_{t m} / d_{t m}\right] tn[1,dmm]t_n \in\left[1, d_{m m}\right],现在负责计算小矩阵的dtm d_{t m} 个元素,其中dtm d_{t m} 是线程块大小。

(Dbm,bmdm×dm)tmtm+dm,tn=(bk=1k/dkAbm,bkdm×dBbk,bndk×dm+Cbm,bndm×dm)tmtm+dm,tn=bk=1k/dmk(tk=1dm(Abm,bkdm×dkk)tmtm+dm,tk(Bbk,bndm×dm)tk,tn)+(Cbm,bndm×dm)tmtm+dm,tn\begin{aligned} &\begin{aligned} & \left(D_{b_m, b_m}^{d_m \times d_m}\right)_{t_m t_m+d_m, t_n}=\left(\sum_{b_k=1}^{k / d_k} A_{b_m, b_k}^{d_m \times d *} B_{b_k, b_n}^{d_k \times d_m}+C_{b_m, b_n}^{d_m \times d_m}\right)_{t_m t_m+d_m, t_n} \end{aligned}\\ &\begin{aligned} & =\sum_{b_k=1}^{k / d_{m k}}\left(\sum_{t_k=1}^{d_m}\left(A_{b_m, b_k}^{d_m \times d_{k_k}}\right)_{t_m t_m+d_m, t_k}\left(B_{b_k, b_n}^{d_m \times d_m}\right)_{t_k, t_n}\right)+\left(C_{b_m, b_n}^{d_m \times d_m}\right)_{t_m t_m+d_m, t_n} \end{aligned} \end{aligned}

Last updated

Was this helpful?