🫐Python光电光对光神经网络非相干光图像低维映射模拟

Python光电光对光神经网络非相干光图像低维映射模拟 | 亚图跨际viadean on Notion

🎯要点

光学散射用例 | 亚图跨际viadean on Notion

🍪语言内容分比

🍇Python矩阵矢量乘法优化

数学定义

矩阵乘矩阵

如果A是 $m \times n$ 矩阵并且B是 $n \times p$ 矩阵

A =\left(\begin{array}{cccc} a_{11} & a_{12} & \cdots & a_{1 n} \\ a_{21} & a_{22} & \cdots & a_{2 n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m 1} & a_{m 2} & \cdots & a_{m n} \end{array}\right), \quad B =\left(\begin{array}{cccc} b_{11} & b_{12} & \cdots & b_{1 p} \\ b_{21} & b_{22} & \cdots & b_{2 p} \\ \vdots & \vdots & \ddots & \vdots \\ b_{n 1} & b_{n 2} & \cdots & b_{n p} \end{array}\right)

矩阵乘积 $C = A B$ （不带乘号或点表示）被定义为 $m \times p$ 矩阵

C =\left(\begin{array}{cccc} c_{11} & c_{12} & \cdots & c_{1 p} \\ c_{21} & c_{22} & \cdots & c_{2 p} \\ \vdots & \vdots & \ddots & \vdots \\ c_{m 1} & c_{m 2} & \cdots & c_{m p} \end{array}\right)

于是

c_{i j}=a_{i 1} b_{1 j}+a_{i 2} b_{2 j}+\cdots+a_{i n} b_{n j}=\sum_{k=1}^n a_{i k} b_{k j}

对于 $i=1, \ldots, m$ 和 $j=1, \ldots, p$ 。

也就是说，乘积的条目 $c_{i j}$ 是通过将A的第i行和B的第j列的条目逐项相乘，然后求和得到的这些 $n$ 乘积。换句话说， $c_{i j}$ 是A 的第i 行和B 的第j 列的点积。

因此， $A B$ 也可以写成

C =\left(\begin{array}{cccc} a_{11} b_{11}+\cdots+a_{1 n} b_{n 1} & a_{11} b_{12}+\cdots+a_{1 n} b_{n 2} & \cdots & a_{11} b_{1 p}+\cdots+a_{1 n} b_{n p} \\ a_{21} b_{11}+\cdots+a_{2 n} b_{n 1} & a_{21} b_{12}+\cdots+a_{2 n} b_{n 2} & \cdots & a_{21} b_{1 p}+\cdots+a_{2 n} b_{n p} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m 1} b_{11}+\cdots+a_{m n} b_{n 1} & a_{m 1} b_{12}+\cdots+a_{m n} b_{n 2} & \cdots & a_{m 1} b_{1 p}+\cdots+a_{m n} b_{n p} \end{array}\right)

因此，当且仅当 A 中的列数等于 B 中的行数（在本例中为 n）时，才定义乘积 A B。

矩阵乘矢量

长度为 $n$ 的向量 $x$ 可以被视为列向量，对应于 $n \times 1$ 矩阵 X，其条目由 $X _{i 1}= x _i$ 给出如果 A 是一个 $m \times n$ 矩阵， $A x$ 表示的矩阵乘以向量乘积就是向量 $y$ ，将其视为列向量，等于 $m \times 1$ 矩阵 $A X$ 。在索引表示法中，这相当于：

y_i=\sum_{j=1}^n a_{i j} x_j

看待这个问题的一种方法是假设从“普通”向量到列向量以及返回的变化是隐式的。

类似地，长度为 n 的向量 x 可以被视为行向量，对应于 $1 \times n$ 矩阵。为了清楚地表明行向量的含义，在这种情况下通常将其表示为列向量的转置；因此，人们会看到诸如 $x ^{ T } A$ 之类的符号。

恒等式 $x ^{ T } A =\left( A ^{ T } x \right)^{ T }$ 成立。在索引表示法中，如果 $A$ 是 $n \times p$ 矩阵，则 $x ^{ T } A = y ^{ T }$ 相当于： $y_k=\sum_{j=1}^n x_j a_{ j k}$ 。

代码优化示例

许多数值计算库都具有高效的矢量化操作实现。矩阵乘法、查找点积等操作非常高效。这些操作的实现是为了利用 CPU 中的多个核心，并将计算卸载到 GPU（如果可用）。通常，矩阵和向量的操作由 BLAS（基本线性代数子程序）提供。一些示例是 Intel MKL、OpenBLAS、cuBLAS 等。

首先我们创建两个矩阵，以便我们可以用它来检查我们的实现是否正确。

 import tensorflow as tf
 import numpy as np
 tf.__version__ # 2.0.0
 
 a = np.random.normal(size=(200, 784)).astype('float32')
 b = np.random.normal(size=(784, 10)).astype('float32')
 
 expected = np.matmul(a, b)

不使用外部库实现功能，

 def py_matmul1(a, b):
     ra, ca = a.shape
     rb, cb = b.shape
     assert ca == rb, f"{ca} != {rb}"
     
     output = np.zeros(shape=(ra, cb))
     for i in range(ra):
         for j in range(cb):
             for k in range(rb):
                 output[i, j] += a[i, k] * b[k, j]
                 
     return output
 
 %time result = py_matmul1(a, b)
 assert result.shape == expected.shape
 assert np.allclose(result, expected, rtol=1e-02), (result, expected)

执行时，在我的计算机上需要 1.38 秒。它相当慢，可以大大改进。如果您注意到，最内层循环基本上是在计算两个向量的点积。在这种情况下，两个向量分别是 a 和 b 的第 i 行和第 j 列。所以让我们用点积实现删除最内层循环。

矢量优化一：

 def py_matmul2(a, b):
     ra, ca = a.shape
     rb, cb = b.shape
     assert ca == rb, f"{ca} != {rb}"
     
     output = np.zeros(shape=(ra, cb))
     for i in range(ra):
         for j in range(cb):
             # we replaced the loop with dot product
             output[i, j] = np.dot(a[i], b[:,j])
                 
     return output
 
 %time result = py_matmul2(a, b)
 assert result.shape == expected.shape
 assert np.allclose(result, expected, rtol=1e-02), (result, expected)

此实现仅需 6 毫秒。与原始实现相比，这是一个巨大的改进。由于内部循环本质上是在计算点积，我们将其替换为 np.dot 函数，并传递矩阵 a 中的第 i 行和矩阵 b 中的第 j 列。

矢量优化二：

现在让我们删除迭代矩阵 b 的列的 for 循环。

 def py_matmul3(a, b):
     ra, ca = a.shape
     rb, cb = b.shape
     assert ca == rb, f"{ca} != {rb}"
     
     output = np.zeros(shape=(ra, cb))
     for i in range(ra):
         output[i] = np.dot(a[i], b)
         
                 
     return output
 
 %time result = py_matmul3(a, b)
 assert result.shape == expected.shape
 assert np.allclose(result, expected, rtol=1e-02), (result, expected)

此实现耗时 2.97 毫秒。使用称为广播的技术，我们可以从本质上消除循环，仅使用一行 output[i] = np.dot(a[i], b)，我们就可以计算输出矩阵第 i 行的整个值。numpy 所做的是广播向量 a[i]，使其与矩阵 b 的形状相匹配。然后它计算每对向量的点积。广播规则在 numpy、tensorflow、pytorch 等主要库中几乎相同。

库优化三：

现在让我们使用 numpy 的内置 matmul 函数。

 def py_matmul4(a, b):
     ra, ca = a.shape
     rb, cb = b.shape
     assert ca == rb, f"{ca} != {rb}"
     
     return np.matmul(a, b)
     
 
 %time result = py_matmul4(a, b)
 assert result.shape == expected.shape
 assert np.allclose(result, expected, rtol=1e-02), (result, expected)

使用numpy的内置matmul函数，需要999 μ 𝜇 s。这是迄今为止我们实施的最快的。

库优化四：

在张量流中，它也与 numpy 非常相似。我们只需要调用 matmul 函数即可。

 def py_matmul5(a, b):
     ra, ca = a.shape
     rb, cb = b.shape
     assert ca == rb, f"{ca} != {rb}"
     
     return tf.matmul(a, b)
     
 tf_a = tf.constant(a)
 tf_b = tf.constant(b)
 %time result = py_matmul5(tf_a, tf_b)
 assert result.shape == expected.shape
 assert np.allclose(result, expected, rtol=1e-02), (result, expected)

TensorFlow 计算结果大约需要 999 μ s。我们可以直接传递 numpy 数组，而不必转换为 TensorFlow 张量，但执行速度会慢一些。在我的实验中，如果我只调用 py_matmul5(a, b)，大约需要 10 毫秒，但使用 tf.constant 函数将 numpy 数组转换为 tf.Tensor 可以获得更好的性能。

PreviousPython和C++及MATLAB低温磁态机器学习模型 NextPython神经模型评估微分方程图算法

Last updated 1 year ago

Was this helpful?