🥦GPU(CUDA)异构众核数值计算

关键词

GPU(CUDA)异构众核数值计算 | Notionviadean on Notion

🏈指点迷津 | Brief

要点

CUDA（C代码）图像处理：使用cutil库编辑全局内存管理算法处理CPU和GPU端之间数据传输。
1. 实现图像的中值过滤方法，cuda代码：
  1. 通用CUDA内核方法，
  2. 3*3中值过滤
    每个邻域像素使用一个寄存器并进行冒泡排序
    使用最少6个寄存器数通过遗忘选择方法找到中值
    使用组合遗忘选择处理每个线程 2 个输出像素值
  3. 5*5中值过滤：通过组合遗忘选择处理每个线程 2 个输出像素值
  4. 快速近似 m × n 中值滤波
2. 实现图像的卷积处理，cuda代码：
  1. 使用硬编码掩码值实现卷积处理
  2. 使用符号内存中的掩码及其作为参数传递的半径来实现卷积运算
  3. 使用符号内存中的掩码和纹理内存中的直接数据获实现 3 * 3 卷积
  4. 在共享内存中预加载数据后实现通用卷积运算
  5. 调用 1D 卷积核之间的数据复制实现 2D 可分离卷积运算
  6. 数据预加载到共享内存后，内核实现水平和垂直一维卷积运算
开发CUDA偏微分方程求解器库文件（C++代码）：矢量类，矩阵类，线性方程组的迭代求解器，迭代求解器的预处理器类，时间积分器类。
CUDA集群GPU开发：
1. 隐式/显示重叠 MPI 通信与 CUDA GPU 计算
2. 流序列明确重叠 MPI 通信与 CUDA GPU 计算
3. 显式重叠 MPI 通信、CUDA CPU/GPU 传输和 CUDA GPU 计算、交错计算通信迭代
4. 使用MPI，OpenMP和CUDA异步通讯\
5. 使用广义最小残差法和共轭梯度法方法求解稀疏线性系统
6. 求解大型稀疏线性系统的整数分解

Python使用CUDA优化图形分割算法示例

在此，我将向您介绍一种借助 GPU 加速图割算法的方法。算法使用情景：

像素强度从背景到前景的急剧变化
前景部分不会与背景部分纠缠太多
图像不太模糊

使用 CUDA 可以将从收缩图到创建残差图的整个过程的运行时间降低到几毫秒。

图看起来像一个网格，每个节点对应于每个像素。这种矩形形状的图允许我们使用二维数组来存储有关节点和边的信息。大小为 image_width x image_length 的四个矩阵将用于存储权重。请注意，权重表示残差图中的容量。对于每个节点，都有指向四个不同方向的边，因此，我们使用每个矩阵来存储每个方向的权重。除了权重之外，我们还需要使用矩阵来存储节点的信息，例如剩余流和高度。

当图像很大时，CPU 上的串行推送重新标记非常慢，因此需要通过在不同节点上并行执行推送或重新标记操作来利用 GPU 的计算能力。在推送重新标记期间，我们只需更新上面介绍的矩阵。为了避免数据依赖问题，每个节点同时向一个方向推送。这个概念被称为“wave update”，由 NVIDIA 提出。

Push功能被编写为CUDA内核。事实上，我有一个不同的推送功能用于在每个方向上推送。每个内核线程将在该行（或列，如果内核向下推送）中的接下来的 m 个节点上工作。

需要调用重新标记，以便更多节点的高度得到纠正，以推送到相邻节点，前提是其任何向外边缘上都留有剩余容量。与 Push 中每个线程必须在多个连续节点上工作不同，我们可以使用一个线程来更新每个节点。重新标记不必在每次迭代中进行，如果您希望每次迭代更快，您可以每两次迭代进行一次。

最终代码：

 import numpy as np
 from numba import *
 from time import sleep
 from build_graph import *
 from min_cut import *
 import cv2
 import sys
 import time
 from operator import *
 
 img = cv2.imread("rose.jpg", cv2.IMREAD_GRAYSCALE)
 GRAPH_SIZE = len(img)
 BLOCK_SIZE = 25
 TILE_SIZE = BLOCK_SIZE
 GRID_SIZE = GRAPH_SIZE // BLOCK_SIZE
 
 WEIGHTS_UP   = np.zeros((GRAPH_SIZE,GRAPH_SIZE))
 WEIGHTS_DOWN = np.zeros((GRAPH_SIZE,GRAPH_SIZE))
 WEIGHTS_LEFT = np.zeros((GRAPH_SIZE,GRAPH_SIZE))
 WEIGHTS_RIGHT= np.zeros((GRAPH_SIZE,GRAPH_SIZE))
 
 #WEIGHTS_SINK = np.zeros((GRAPH_SIZE,GRAPH_SIZE))
 
 HEIGHTS= np.zeros((GRAPH_SIZE,GRAPH_SIZE),dtype=np.int32)
 HEIGHTS_BUFFER= np.zeros((GRAPH_SIZE, GRAPH_SIZE),dtype=np.uint32)
 
 EXCESS_FLOW = np.zeros((GRAPH_SIZE,GRAPH_SIZE))
 
 EXCESS_TILE = np.zeros((GRAPH_SIZE,GRAPH_SIZE))
 
 ACTIVE_NODES = np.zeros((GRAPH_SIZE,GRAPH_SIZE), dtype=bool)
 ACTIVE_TILES = np.ones((GRID_SIZE,GRID_SIZE), dtype=bool)
 COMPR_CAP = np.zeros((GRAPH_SIZE, GRAPH_SIZE), dtype=np.uint8)
 #FINAL OUTPUT
 SEG = np.zeros((GRAPH_SIZE,GRAPH_SIZE))
 
 @cuda.jit(device=True)
 def isactive(height,excess):
     return (height < GRAPH_SIZE*GRAPH_SIZE) and (excess > 0)
 
 @cuda.jit(device=True)
 def right_wave(sH, sE, x, y, WEIGHTS_RIGHT, CC, EXCESS_FLOW, EXCESS_TILE, BORDER_H):
     offy=cuda.blockIdx.y*cuda.blockDim.x; offx=cuda.blockIdx.x*cuda.blockDim.x
     tile_len=cuda.blockDim.x; ef = 0; nei_h=0
     
     for i in range(0,tile_len): 
         if offx+i==GRAPH_SIZE-1: #hit right bound, can't push! 
             sE[y, i] = ef 
             break
         h=sH[y,i]
         e=sE[y,i]
         if isactive(h,e):
             #print("active, gonna push\n")
             #print("neigbor h for  ", offx+i," ", offy+y, " is ",sH[y,i+1])
             if i<tile_len-1:
                 nei_h=sH[y,i+1]
             else:
                 nei_h=BORDER_H
             #print("active, gonna push\n")
             #print("neigbor h for  ", offx+i," ", offy+y, " is ", nei_h)
 
             if h==nei_h+1:    #problem: out of bound
             #   print("good h, gonna push\n")
                 w  = WEIGHTS_RIGHT[offy+y, offx+i]
                 ef = ef + e
                 flow = min(w, ef)
                 cc = CC[offy+y,offx+i] 
                 if (w>flow):
                     CC[offy+y,offx+i]=cc|1
                 else:
                     CC[offy+y,offx+i]=cc&(~1)
                 WEIGHTS_RIGHT[offy+y, offx+i] = w-flow
                 sE[y, i] = ef - flow
                 ef = flow
                 
     if offx+tile_len < GRAPH_SIZE:
         EXCESS_TILE[offy+y][offx+tile_len] = ef
     
 @cuda.jit(device=True)
 def up_wave(sH, sE, x, y, WEIGHTS_UP, CC, EXCESS_FLOW, EXCESS_TILE, BORDER_H):
     # note: assume square image, and because dim.y is 1, we use dim.x for y length
     offy=cuda.blockIdx.y*cuda.blockDim.x;offx=cuda.blockIdx.x*cuda.blockDim.x
     tile_len=cuda.blockDim.x;ef = 0;nei_h=0
     for i in range(1,tile_len+1):
         if offy+tile_len-i==0:   #hit img top, don't push
             sE[tile_len-i, x] = ef
             break
         h=sH[tile_len-i,x]
         e=sE[tile_len-i,x]
         if isactive(h,e):
             if tile_len-i>0:
                 nei_h=sH[tile_len-i-1,x]
             else:
                 nei_h=BORDER_H
             if  h==nei_h+1: 
                 w=WEIGHTS_UP[offy+tile_len-i,offx+x]
                 ef=ef+e
                 flow=min(w,ef)
                 cc=CC[offy+tile_len-i,offx+x]
                 if (w>flow):
                     CC[offy+tile_len-i,offx+x]=cc|8
                 else:
                     CC[offy+tile_len-i,offx+x]=cc&(~8)
                 WEIGHTS_UP[offy+tile_len-i,offx+x] = w-flow
                 sE[tile_len-i, x]=ef-flow
                 ef=flow
     if (offy-1)>=0:
         EXCESS_TILE[offy-1][offx+x]=ef
 
 @cuda.jit(device=True)
 def down_wave(sH, sE, x, y, WEIGHTS_DOWN, CC, EXCESS_FLOW, EXCESS_TILE, BORDER_H):
     offy=cuda.blockIdx.y*cuda.blockDim.x;offx=cuda.blockIdx.x*cuda.blockDim.x
     tile_len=cuda.blockDim.x;ef = 0;nei_h=0
     for i in range(0,tile_len):
         if offy+i==GRAPH_SIZE-1: #hit img bottom, keep excess flow, don't push
             sE[i,x] = ef
             break
         h=sH[i,x]
         e=sE[i,x]
         if isactive(h,e):
             if i<tile_len-1:
                 nei_h=sH[i+1,x] 
             else:
                 nei_h=BORDER_H
             if h==nei_h+1:
                 w=WEIGHTS_DOWN[offy+i,offx+x]
                 ef=ef+e
                 flow = min(w,ef)
                 cc=CC[offy+i,offx+x]
                 if (w>flow):
                     CC[offy+i,offx+x]=cc|2
                 else:
                     CC[offy+i,offx+x]=cc&(~2)
                 WEIGHTS_DOWN[offy+i,offx+x]=w-flow
                 sE[i,x]=ef-flow
                 ef=flow
     if offy+tile_len<GRAPH_SIZE:
         EXCESS_TILE[offy+tile_len][offx+x] = ef
 
 #this will be updatedi, or deleted
 @cuda.jit() # for this kernel, use (gs,gs), (bs,bs)
 # before launching this kernel, copy HEIGHTS into HEIGHTS_BUFFER, and empty HEIGHTS for writing
 def local_relabel(EXCESS_FLOW, CC, H, HB, ACTIVE_NODES):
     # this is a seperarte kernel, each thread relabes one node
     
     offx=cuda.blockIdx.x*cuda.blockDim.x; offy=cuda.blockIdx.y*cuda.blockDim.y
     x=cuda.threadIdx.x; y=cuda.threadIdx.y; tile_len=cuda.blockDim.x
     e=EXCESS_FLOW[offy+y,offx+x]
     h=HB[offy+y,offx+x]
     c = CC[offy+y,offx+x]
     if isactive(h,e):
         min_h=GRAPH_SIZE*GRAPH_SIZE
         if offx+x+1<GRAPH_SIZE:
             if c == (c|1): #right edge available
                 min_h=min(min_h, HB[offy+y, offx+x+1]+1)
         #       print("right  h  for ", offx+x, " ", offy+y, " is ",HB[offy+y, offx+x])
         if offy+y-1>=0:
             if c == (c|8): #up edge available
                 min_h=min(min_h, HB[offy+y-1,offx+x]+1)
         #       print("up   h  for ", offx+x, " ", offy+y, " is ",HB[offy+y, offx+x])
         if offy+y+1<GRAPH_SIZE:
             if c == (c|2): #down edge available
                 min_h=min(min_h, HB[offy+y+1,offx+x]+1)
         #       print("down  h for ", offx+x, " ", offy+y, " is ",HB[offy+y, offx+x] )
         if offx+x-1>=0:
             if c == (c|4): #left edge available
                 min_h=min(min_h, HB[offy+y, offx+x-1]+1)
         #       print("left  h for ", offx+x, " ", offy+y, " is ",HB[offy+y, offx+x])
 
         H[offy+y,offx+x]=min_h
         
     else:
         H[offy+y,offx+x]=h
     ACTIVE_NODES[offy+y, offx+x]=isactive(H[offy+y,offx+x],e)           
     
     
 @cuda.jit(device=True)
 def build_tile(sH, sE, HEIGHTS, EXCESS_FLOW, EXCESS_TILE,  y, TYPE):
     offx=cuda.blockIdx.x * cuda.blockDim.x; offy=cuda.blockIdx.y * cuda.blockDim.x
     tile_len=cuda.blockDim.x
     for x in range(tile_len):
         sE[y, x]  = EXCESS_FLOW[offy+y, offx+x] 
         sH[y, x]  = HEIGHTS[offy+y, offx+x]
     if TYPE==0: #after a right wave
         if offx>0: #left bound  
             sE[y, 0] = sE[y, 0] + EXCESS_TILE[offy+y,offx]
             EXCESS_TILE[offy+y,offx]=0
     if TYPE==2: #after a left wave
         if offx+tile_len-1<GRAPH_SIZE:  #right bound
             sE[y, tile_len-1] = sE[y, tile_len-1] + EXCESS_TILE[offy+y,offx+tile_len-1]
             EXCESS_TILE[offy+y,offx+tile_len-1]=0
     if TYPE==1: #aftaer down push
         if offy>0:  #upper bound
             sE[0,y] = sE[0, y] + EXCESS_TILE[offy, offx+y]
             EXCESS_TILE[offy,offx+y]=0
     if TYPE==3: #AFTER up push
         if offy+tile_len-1<GRAPH_SIZE: #lower bound
             sE[tile_len-1,y] = sE[tile_len-1, y] + EXCESS_TILE[offy+tile_len-1, offx+y]
             EXCESS_TILE[offy+tile_len-1,offx+y]=0
 
 @cuda.jit()
 def push_down(WEIGHTS,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,TYPE):
     
     sH = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32)
     sE = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32)
     lx = cuda.threadIdx.x; tile_len = cuda.blockDim.x   
     bx = cuda.blockIdx.x; by = cuda.blockIdx.y
     offx = bx*cuda.blockDim.x;offy = by*cuda.blockDim.x
     BORDER_H=HEIGHTS[offy+tile_len,offx+lx]
 
     bt_type=0
     build_tile(sH, sE, HEIGHTS, EXCESS_FLOW, EXCESS_TILE, lx, bt_type)
     cuda.syncthreads()
     if ACTIVE_TILES[by,bx]:
         down_wave(sH, sE, lx,0, WEIGHTS, COMPR_CAP, EXCESS_FLOW, EXCESS_TILE,BORDER_H)
         for i in range(tile_len):
             EXCESS_FLOW[offy+lx, offx+i]= sE[lx, i]
     
 @cuda.jit()
 def push_left(WEIGHTS,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,TYPE):
     
     sH = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32)
     sE = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32)
     lx = cuda.threadIdx.x; tile_len = cuda.blockDim.x   
     bx = cuda.blockIdx.x; by = cuda.blockIdx.y
     offx = bx*cuda.blockDim.x;offy = by*cuda.blockDim.x
     BORDER_H=HEIGHTS[offy+lx,offx-1]
 
     bt_type=1
     build_tile(sH, sE, HEIGHTS, EXCESS_FLOW, EXCESS_TILE, lx, bt_type)
     cuda.syncthreads()
     if ACTIVE_TILES[by,bx]:
         left_wave(sH, sE, 0,lx, WEIGHTS, COMPR_CAP, EXCESS_FLOW, EXCESS_TILE, BORDER_H)
         for i in range(tile_len):
             EXCESS_FLOW[offy+lx, offx+i]= sE[lx, i]
     
 @cuda.jit()
 def push_up(WEIGHTS,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,TYPE):
     
     sH = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32)
     sE = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32)
     lx = cuda.threadIdx.x; tile_len = cuda.blockDim.x
     bx = cuda.blockIdx.x; by = cuda.blockIdx.y
     offx = bx*cuda.blockDim.x;offy = by*cuda.blockDim.x
     BORDER_H=HEIGHTS[offy-1,offx+lx]
 
     bt_type=2
     build_tile(sH, sE, HEIGHTS, EXCESS_FLOW, EXCESS_TILE, lx, bt_type)
     cuda.syncthreads()
     if ACTIVE_TILES[by,bx]:
         up_wave(sH, sE, lx,0, WEIGHTS, COMPR_CAP, EXCESS_FLOW, EXCESS_TILE, BORDER_H)
         for i in range(tile_len):
             EXCESS_FLOW[offy+lx, offx+i]= sE[lx, i]
     
 
 @cuda.jit()
 def global_bfs_t(H,EXCESS_FLOW,PF,PB,ASSIGNED_MASK):
     
     offx = cuda.blockIdx.x*cuda.blockDim.x; offy = cuda.blockIdx.y*cuda.blockDim.x
     x = cuda.threadIdx.x; y = cuda.threadIdx.y
     h=H[offy+y,offx+x];e=EXCESS_FLOW[offy+y,offx+x]
     if EXCESS_FLOW[offy+y,offx+x]<0:
         H[offy+y,offx+x]=1
         ASSIGNED_MASK[offy+y,offx+x]=True
     else: 
         H[offy+y,offx+x]=GRAPH_SIZE*GRAPH_SIZE
 
 
 def global_bfs(H, CC, iters, isover, ASSIGNED_MASK):
     
     offx = cuda.blockIdx.x*cuda.blockDim.x;offy = cuda.blockIdx.y*cuda.blockDim.x
     x = cuda.threadIdx.x; y = cuda.threadIdx.y
     c=CC[offy+y,offx+x]; 
     h = H[offy+y,offx+x]; # if h<GRAPH_SIZE*GRAPH_SIZE and not ASSIGNED_MASK[offy+y,offx+x]:
     if not ASSIGNED_MASK[offy+y,offx+x]:
         if offx+x+1<GRAPH_SIZE:
             if (c==(c|1)) & H[offy+y,offx+x+1]==iters and ASSIGNED_MASK[offy+y,offx+x+1]:
                 isover=True
                 H[offy+y,offx+x]=iters+1 
                 ASSIGNED_MASK[offy+y,offx+x]=1
         if offx+x-1>=0:
             if (c==(c|4)) & H[offy+y,offx+x-1]==iters and ASSIGNED_MASK[offy+y,offx+x-1]:
                 isover=True
                 H[offy+y,offx+x]=iters+1
                 ASSIGNED_MASK[offy+y,offx+x]=1
         if offy+y+1<GRAPH_SIZE:
             if (c==(c|2)) & H[offy+y+1,offx+x]==iters and ASSIGNED_MASK[offy+y+1,offx+x]:
                 isover=True
                 H[offy+y,offx+x]=iters+1
                 ASSIGNED_MASK[offy+y,offx+x]=1
         if offy+y-1>=0:
             if (c==(c|8)) & H[offy+y-1,offx+x]==iters and ASSIGNED_MASK[offy+y-1,offx+x+1]:
                 isover=True
                 H[offy+y,offx+x]=iters+1
                 ASSIGNED_MASK[offy+y,offx+x]=1
 
     # fix: update ACTIVE_NODES here
     ACTIVE_NODES[offy+y,offx+x]=isactive(H[offy+y,offx+x])
     cuda.syncthreads()
     
 
     
 
 @cuda.jit()
 def reduce_active_nodes(ACTIVE_NODES, ACTIVE_TILES, L):
     
     offx = cuda.threadIdx.x*L
     offy = cuda.threadIdx.y*L
     x = cuda.threadIdx.x
     y = cuda.threadIdx.y
     for j in range(L):
         for i in range(L):
             if ACTIVE_NODES[offy+j,offx+i]==True:
                 ACTIVE_TILES[y,x]=True
                 return  
     ACTIVE_TILES[y,x]=False
     
 @cuda.jit()
 def initialize_compressed_capacities(WL,WR,WU,WD,CC):
     
     # set initilial compressed capacity for first relabelling
     x=cuda.blockDim.x*cuda.blockIdx.x+cuda.threadIdx.x
     y=cuda.blockDim.y*cuda.blockIdx.y+cuda.threadIdx.y
     R=0;D=0;L=0;U=0;
     R=1&(WR[y,x]>0)
     D=2&(WD[y,x]>0)
     L=4&(WL[y,x]>0)
     U=8&(WU[y,x]>0)
     CC[y,x] = (R | D | L | U)
     
 def has_converged():
     #return np.sum(ACTIVE_NODES)==0
     return np.sum(ACTIVE_TILES)==0
 
 
 
 def maxflow():
     cuda.select_device(0)
     threadsperblock = (BLOCK_SIZE,1)
     blockspergrid = (GRID_SIZE,GRID_SIZE)
     iters=0
     EXCESS_TILE=np.zeros((GRAPH_SIZE,GRAPH_SIZE))
     ASSIGNED_MASK=np.zeros((GRAPH_SIZE,GRAPH_SIZE),dtype=bool)
 
     initialize_compressed_capacities[blockspergrid,(BLOCK_SIZE,BLOCK_SIZE)](WEIGHTS_LEFT,WEIGHTS_RIGHT,WEIGHTS_UP,WEIGHTS_DOWN,COMPR_CAP)   
 
     start=cuda.event(timing=True)
     stop=cuda.event(timing=True)
     start.record()
     begin=time.time()
     
     
     while (not has_converged()) or iters<5:
     
         push_right[blockspergrid, (BLOCK_SIZE,1)](WEIGHTS_RIGHT,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,0)  
         cuda.synchronize()
 
         push_down[blockspergrid, (BLOCK_SIZE,1)](WEIGHTS_DOWN,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,1)
         cuda.synchronize()
     
         push_left[blockspergrid, (BLOCK_SIZE,1)](WEIGHTS_LEFT,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,2)
         cuda.synchronize()
         
         push_up[blockspergrid, (BLOCK_SIZE,1)](WEIGHTS_UP,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,3)
         cuda.synchronize()
 
         HEIGHTS_BUFFER=np.copy(HEIGHTS)
         HEIGHTS.fill(0)
         local_relabel[blockspergrid, (BLOCK_SIZE,BLOCK_SIZE)](EXCESS_FLOW, COMPR_CAP, HEIGHTS, HEIGHTS_BUFFER, ACTIVE_NODES)
         cuda.synchronize()
 
         if iters%6==0:
             isover=True
             ASSIGNED_MASK.fill(0)
             global_bfs_t[(GRID_SIZE,GRID_SIZE),(BLOCK_SIZE,BLOCK_SIZE)](HEIGHTS,EXCESS_FLOW,PF,PB,ASSIGNED_MASK)
             cuda.synchronize()
             gl_count=1
             while (isover):
                 isover=False
                 global_bfs[(GRID_SIZE,GRID_SIZE),(BLOCK_SIZE,BLOCK_SIZE)](HEIGHTS,COMPR_CAP,gl_count,isover,ASSIGNED_MASK)
                 gl_count=gl_count+1
                 cuda.synchronize()
 
         reduce_active_nodes[1,(GRID_SIZE, GRID_SIZE)](ACTIVE_NODES,ACTIVE_TILES,BLOCK_SIZE)
         cuda.synchronize()
     
         iters=iters+1
 
 
         print("This is iteration %d"%iters)
     end=time.time()
     print("-Done in "+str(end-begin))
 
     stop.record()
     stop.synchronize()
     GPU_TIME=cuda.event_elapsed_time(start,stop)
 
     print("It took %d iterations to converge"%iters)
     print("Total GPU time: %dms"%(GPU_TIME))
 
 
 def generate_image_seg_mask(WL,WR,WU,WD,img):
     WR_c=np.where(WR>0,0,1)
     WL_c=np.where(WR>0,0,1)
     WU_c=np.where(WR>0,0,1)
     WD_c=np.where(WR>0,0,1)
     
     Mask1=WR_c+WL_c
     Mask2=WU_c+WD_c
     Mask=Mask1+Mask2
     Mask=np.where(Mask>0,1,0)
     Mask=np.multiply(Mask,255)   # use white color for segmentation boudry
 
     img=img+Mask
     img=np.where(img>255,255,img)
     cv2.imwrite('seg.png',img)
 
 
 
 #test with small graph, wait, need to count active tiles
 if __name__ == "__main__":
 
     img, WEIGHTS_UP, WEIGHTS_DOWN, WEIGHTS_LEFT, WEIGHTS_RIGHT, EXCESS_FLOW, PF, PB=build_graph(GRID_SIZE, BLOCK_SIZE,img,WEIGHTS_UP,WEIGHTS_DOWN,WEIGHTS_LEFT,WEIGHTS_RIGHT,EXCESS_FLOW,PF,PB)
 
     ORG_WU = np.copy(WEIGHTS_UP)
     ORG_WD = np.copy(WEIGHTS_DOWN)
     ORG_WL = np.copy(WEIGHTS_LEFT)
     ORG_WR = np.copy(WEIGHTS_RIGHT)
     
     maxflow()
     generate_image_seg_mask(WEIGHTS_LEFT,WEIGHTS_RIGHT,WEIGHTS_UP,WEIGHTS_DOWN,img)
     #SEG = mincut(WEIGHTS_UP, WEIGHTS_DOWN, WEIGHTS_LEFT, WEIGHTS_RIGHT, ORG_WU, ORG_WD, ORG_WL, ORG_WR, SEG, GRID_SIZE, BLOCK_SIZE)
 
     wl=0
     wr=0
     wu=0
     wd=0
 
     for j in range(len(img)):
         for i in range(len(img[0])):
             if ORG_WU[j,i]!=0 and WEIGHTS_UP[j,i]==0:
                 wu=wu+1
             if ORG_WD[j,i]!=0 and WEIGHTS_DOWN[j,i]==0:
                 wr=wr+1
             if ORG_WL[j,i]!=0 and WEIGHTS_LEFT[j,i]==0:
                 wl=wl+1
             if ORG_WR[j,i]!=0 and WEIGHTS_RIGHT[j,i]==0:
                 wd=wd+1
     
     print("num of WL changed: %d" % wl)
     print("num of WR changed: %d" % wr)
     print("num of WU changed: %d" % wu)
     print("num of WD changed: %d" % wd)

PreviousC++数字化声音信号处理和数控振荡合成 NextPython和命令行脚本(C/C++)嗅探物联网络及短中远程无线电

Last updated 1 year ago

Was this helpful?

import numpy as np from numba import * from time import sleep from build_graph import * from min_cut import * import cv2 import sys import time from operator import * img = cv2.imread("rose.jpg", cv2.IMREAD_GRAYSCALE) GRAPH_SIZE = len(img) BLOCK_SIZE = 25 TILE_SIZE = BLOCK_SIZE GRID_SIZE = GRAPH_SIZE // BLOCK_SIZE WEIGHTS_UP = np.zeros((GRAPH_SIZE,GRAPH_SIZE)) WEIGHTS_DOWN = np.zeros((GRAPH_SIZE,GRAPH_SIZE)) WEIGHTS_LEFT = np.zeros((GRAPH_SIZE,GRAPH_SIZE)) WEIGHTS_RIGHT= np.zeros((GRAPH_SIZE,GRAPH_SIZE)) #WEIGHTS_SINK = np.zeros((GRAPH_SIZE,GRAPH_SIZE)) HEIGHTS= np.zeros((GRAPH_SIZE,GRAPH_SIZE),dtype=np.int32) HEIGHTS_BUFFER= np.zeros((GRAPH_SIZE, GRAPH_SIZE),dtype=np.uint32) EXCESS_FLOW = np.zeros((GRAPH_SIZE,GRAPH_SIZE)) EXCESS_TILE = np.zeros((GRAPH_SIZE,GRAPH_SIZE)) ACTIVE_NODES = np.zeros((GRAPH_SIZE,GRAPH_SIZE), dtype=bool) ACTIVE_TILES = np.ones((GRID_SIZE,GRID_SIZE), dtype=bool) COMPR_CAP = np.zeros((GRAPH_SIZE, GRAPH_SIZE), dtype=np.uint8) #FINAL OUTPUT SEG = np.zeros((GRAPH_SIZE,GRAPH_SIZE)) @cuda.jit(device=True) def isactive(height,excess): return (height < GRAPH_SIZE*GRAPH_SIZE) and (excess > 0) @cuda.jit(device=True) def right_wave(sH, sE, x, y, WEIGHTS_RIGHT, CC, EXCESS_FLOW, EXCESS_TILE, BORDER_H): offy=cuda.blockIdx.y*cuda.blockDim.x; offx=cuda.blockIdx.x*cuda.blockDim.x tile_len=cuda.blockDim.x; ef = 0; nei_h=0 for i in range(0,tile_len): if offx+i==GRAPH_SIZE-1: #hit right bound, can't push! sE[y, i] = ef break h=sH[y,i] e=sE[y,i] if isactive(h,e): #print("active, gonna push\n") #print("neigbor h for ", offx+i," ", offy+y, " is ",sH[y,i+1]) if i<tile_len-1: nei_h=sH[y,i+1] else: nei_h=BORDER_H #print("active, gonna push\n") #print("neigbor h for ", offx+i," ", offy+y, " is ", nei_h) if h==nei_h+1: #problem: out of bound # print("good h, gonna push\n") w = WEIGHTS_RIGHT[offy+y, offx+i] ef = ef + e flow = min(w, ef) cc = CC[offy+y,offx+i] if (w>flow): CC[offy+y,offx+i]=cc|1 else: CC[offy+y,offx+i]=cc&(~1) WEIGHTS_RIGHT[offy+y, offx+i] = w-flow sE[y, i] = ef - flow ef = flow if offx+tile_len < GRAPH_SIZE: EXCESS_TILE[offy+y][offx+tile_len] = ef @cuda.jit(device=True) def up_wave(sH, sE, x, y, WEIGHTS_UP, CC, EXCESS_FLOW, EXCESS_TILE, BORDER_H): # note: assume square image, and because dim.y is 1, we use dim.x for y length offy=cuda.blockIdx.y*cuda.blockDim.x;offx=cuda.blockIdx.x*cuda.blockDim.x tile_len=cuda.blockDim.x;ef = 0;nei_h=0 for i in range(1,tile_len+1): if offy+tile_len-i==0: #hit img top, don't push sE[tile_len-i, x] = ef break h=sH[tile_len-i,x] e=sE[tile_len-i,x] if isactive(h,e): if tile_len-i>0: nei_h=sH[tile_len-i-1,x] else: nei_h=BORDER_H if h==nei_h+1: w=WEIGHTS_UP[offy+tile_len-i,offx+x] ef=ef+e flow=min(w,ef) cc=CC[offy+tile_len-i,offx+x] if (w>flow): CC[offy+tile_len-i,offx+x]=cc|8 else: CC[offy+tile_len-i,offx+x]=cc&(~8) WEIGHTS_UP[offy+tile_len-i,offx+x] = w-flow sE[tile_len-i, x]=ef-flow ef=flow if (offy-1)>=0: EXCESS_TILE[offy-1][offx+x]=ef @cuda.jit(device=True) def down_wave(sH, sE, x, y, WEIGHTS_DOWN, CC, EXCESS_FLOW, EXCESS_TILE, BORDER_H): offy=cuda.blockIdx.y*cuda.blockDim.x;offx=cuda.blockIdx.x*cuda.blockDim.x tile_len=cuda.blockDim.x;ef = 0;nei_h=0 for i in range(0,tile_len): if offy+i==GRAPH_SIZE-1: #hit img bottom, keep excess flow, don't push sE[i,x] = ef break h=sH[i,x] e=sE[i,x] if isactive(h,e): if i<tile_len-1: nei_h=sH[i+1,x] else: nei_h=BORDER_H if h==nei_h+1: w=WEIGHTS_DOWN[offy+i,offx+x] ef=ef+e flow = min(w,ef) cc=CC[offy+i,offx+x] if (w>flow): CC[offy+i,offx+x]=cc|2 else: CC[offy+i,offx+x]=cc&(~2) WEIGHTS_DOWN[offy+i,offx+x]=w-flow sE[i,x]=ef-flow ef=flow if offy+tile_len<GRAPH_SIZE: EXCESS_TILE[offy+tile_len][offx+x] = ef #this will be updatedi, or deleted @cuda.jit() # for this kernel, use (gs,gs), (bs,bs) # before launching this kernel, copy HEIGHTS into HEIGHTS_BUFFER, and empty HEIGHTS for writing def local_relabel(EXCESS_FLOW, CC, H, HB, ACTIVE_NODES): # this is a seperarte kernel, each thread relabes one node offx=cuda.blockIdx.x*cuda.blockDim.x; offy=cuda.blockIdx.y*cuda.blockDim.y x=cuda.threadIdx.x; y=cuda.threadIdx.y; tile_len=cuda.blockDim.x e=EXCESS_FLOW[offy+y,offx+x] h=HB[offy+y,offx+x] c = CC[offy+y,offx+x] if isactive(h,e): min_h=GRAPH_SIZE*GRAPH_SIZE if offx+x+1<GRAPH_SIZE: if c == (c|1): #right edge available min_h=min(min_h, HB[offy+y, offx+x+1]+1) # print("right h for ", offx+x, " ", offy+y, " is ",HB[offy+y, offx+x]) if offy+y-1>=0: if c == (c|8): #up edge available min_h=min(min_h, HB[offy+y-1,offx+x]+1) # print("up h for ", offx+x, " ", offy+y, " is ",HB[offy+y, offx+x]) if offy+y+1<GRAPH_SIZE: if c == (c|2): #down edge available min_h=min(min_h, HB[offy+y+1,offx+x]+1) # print("down h for ", offx+x, " ", offy+y, " is ",HB[offy+y, offx+x] ) if offx+x-1>=0: if c == (c|4): #left edge available min_h=min(min_h, HB[offy+y, offx+x-1]+1) # print("left h for ", offx+x, " ", offy+y, " is ",HB[offy+y, offx+x]) H[offy+y,offx+x]=min_h else: H[offy+y,offx+x]=h ACTIVE_NODES[offy+y, offx+x]=isactive(H[offy+y,offx+x],e) @cuda.jit(device=True) def build_tile(sH, sE, HEIGHTS, EXCESS_FLOW, EXCESS_TILE, y, TYPE): offx=cuda.blockIdx.x * cuda.blockDim.x; offy=cuda.blockIdx.y * cuda.blockDim.x tile_len=cuda.blockDim.x for x in range(tile_len): sE[y, x] = EXCESS_FLOW[offy+y, offx+x] sH[y, x] = HEIGHTS[offy+y, offx+x] if TYPE==0: #after a right wave if offx>0: #left bound sE[y, 0] = sE[y, 0] + EXCESS_TILE[offy+y,offx] EXCESS_TILE[offy+y,offx]=0 if TYPE==2: #after a left wave if offx+tile_len-1<GRAPH_SIZE: #right bound sE[y, tile_len-1] = sE[y, tile_len-1] + EXCESS_TILE[offy+y,offx+tile_len-1] EXCESS_TILE[offy+y,offx+tile_len-1]=0 if TYPE==1: #aftaer down push if offy>0: #upper bound sE[0,y] = sE[0, y] + EXCESS_TILE[offy, offx+y] EXCESS_TILE[offy,offx+y]=0 if TYPE==3: #AFTER up push if offy+tile_len-1<GRAPH_SIZE: #lower bound sE[tile_len-1,y] = sE[tile_len-1, y] + EXCESS_TILE[offy+tile_len-1, offx+y] EXCESS_TILE[offy+tile_len-1,offx+y]=0 @cuda.jit() def push_down(WEIGHTS,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,TYPE): sH = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32) sE = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32) lx = cuda.threadIdx.x; tile_len = cuda.blockDim.x bx = cuda.blockIdx.x; by = cuda.blockIdx.y offx = bx*cuda.blockDim.x;offy = by*cuda.blockDim.x BORDER_H=HEIGHTS[offy+tile_len,offx+lx] bt_type=0 build_tile(sH, sE, HEIGHTS, EXCESS_FLOW, EXCESS_TILE, lx, bt_type) cuda.syncthreads() if ACTIVE_TILES[by,bx]: down_wave(sH, sE, lx,0, WEIGHTS, COMPR_CAP, EXCESS_FLOW, EXCESS_TILE,BORDER_H) for i in range(tile_len): EXCESS_FLOW[offy+lx, offx+i]= sE[lx, i] @cuda.jit() def push_left(WEIGHTS,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,TYPE): sH = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32) sE = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32) lx = cuda.threadIdx.x; tile_len = cuda.blockDim.x bx = cuda.blockIdx.x; by = cuda.blockIdx.y offx = bx*cuda.blockDim.x;offy = by*cuda.blockDim.x BORDER_H=HEIGHTS[offy+lx,offx-1] bt_type=1 build_tile(sH, sE, HEIGHTS, EXCESS_FLOW, EXCESS_TILE, lx, bt_type) cuda.syncthreads() if ACTIVE_TILES[by,bx]: left_wave(sH, sE, 0,lx, WEIGHTS, COMPR_CAP, EXCESS_FLOW, EXCESS_TILE, BORDER_H) for i in range(tile_len): EXCESS_FLOW[offy+lx, offx+i]= sE[lx, i] @cuda.jit() def push_up(WEIGHTS,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,TYPE): sH = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32) sE = cuda.shared.array(shape=(BLOCK_SIZE,BLOCK_SIZE), dtype=int32) lx = cuda.threadIdx.x; tile_len = cuda.blockDim.x bx = cuda.blockIdx.x; by = cuda.blockIdx.y offx = bx*cuda.blockDim.x;offy = by*cuda.blockDim.x BORDER_H=HEIGHTS[offy-1,offx+lx] bt_type=2 build_tile(sH, sE, HEIGHTS, EXCESS_FLOW, EXCESS_TILE, lx, bt_type) cuda.syncthreads() if ACTIVE_TILES[by,bx]: up_wave(sH, sE, lx,0, WEIGHTS, COMPR_CAP, EXCESS_FLOW, EXCESS_TILE, BORDER_H) for i in range(tile_len): EXCESS_FLOW[offy+lx, offx+i]= sE[lx, i] @cuda.jit() def global_bfs_t(H,EXCESS_FLOW,PF,PB,ASSIGNED_MASK): offx = cuda.blockIdx.x*cuda.blockDim.x; offy = cuda.blockIdx.y*cuda.blockDim.x x = cuda.threadIdx.x; y = cuda.threadIdx.y h=H[offy+y,offx+x];e=EXCESS_FLOW[offy+y,offx+x] if EXCESS_FLOW[offy+y,offx+x]<0: H[offy+y,offx+x]=1 ASSIGNED_MASK[offy+y,offx+x]=True else: H[offy+y,offx+x]=GRAPH_SIZE*GRAPH_SIZE def global_bfs(H, CC, iters, isover, ASSIGNED_MASK): offx = cuda.blockIdx.x*cuda.blockDim.x;offy = cuda.blockIdx.y*cuda.blockDim.x x = cuda.threadIdx.x; y = cuda.threadIdx.y c=CC[offy+y,offx+x]; h = H[offy+y,offx+x]; # if h<GRAPH_SIZE*GRAPH_SIZE and not ASSIGNED_MASK[offy+y,offx+x]: if not ASSIGNED_MASK[offy+y,offx+x]: if offx+x+1<GRAPH_SIZE: if (c==(c|1)) & H[offy+y,offx+x+1]==iters and ASSIGNED_MASK[offy+y,offx+x+1]: isover=True H[offy+y,offx+x]=iters+1 ASSIGNED_MASK[offy+y,offx+x]=1 if offx+x-1>=0: if (c==(c|4)) & H[offy+y,offx+x-1]==iters and ASSIGNED_MASK[offy+y,offx+x-1]: isover=True H[offy+y,offx+x]=iters+1 ASSIGNED_MASK[offy+y,offx+x]=1 if offy+y+1<GRAPH_SIZE: if (c==(c|2)) & H[offy+y+1,offx+x]==iters and ASSIGNED_MASK[offy+y+1,offx+x]: isover=True H[offy+y,offx+x]=iters+1 ASSIGNED_MASK[offy+y,offx+x]=1 if offy+y-1>=0: if (c==(c|8)) & H[offy+y-1,offx+x]==iters and ASSIGNED_MASK[offy+y-1,offx+x+1]: isover=True H[offy+y,offx+x]=iters+1 ASSIGNED_MASK[offy+y,offx+x]=1 # fix: update ACTIVE_NODES here ACTIVE_NODES[offy+y,offx+x]=isactive(H[offy+y,offx+x]) cuda.syncthreads() @cuda.jit() def reduce_active_nodes(ACTIVE_NODES, ACTIVE_TILES, L): offx = cuda.threadIdx.x*L offy = cuda.threadIdx.y*L x = cuda.threadIdx.x y = cuda.threadIdx.y for j in range(L): for i in range(L): if ACTIVE_NODES[offy+j,offx+i]==True: ACTIVE_TILES[y,x]=True return ACTIVE_TILES[y,x]=False @cuda.jit() def initialize_compressed_capacities(WL,WR,WU,WD,CC): # set initilial compressed capacity for first relabelling x=cuda.blockDim.x*cuda.blockIdx.x+cuda.threadIdx.x y=cuda.blockDim.y*cuda.blockIdx.y+cuda.threadIdx.y R=0;D=0;L=0;U=0; R=1&(WR[y,x]>0) D=2&(WD[y,x]>0) L=4&(WL[y,x]>0) U=8&(WU[y,x]>0) CC[y,x] = (R | D | L | U) def has_converged(): #return np.sum(ACTIVE_NODES)==0 return np.sum(ACTIVE_TILES)==0 def maxflow(): cuda.select_device(0) threadsperblock = (BLOCK_SIZE,1) blockspergrid = (GRID_SIZE,GRID_SIZE) iters=0 EXCESS_TILE=np.zeros((GRAPH_SIZE,GRAPH_SIZE)) ASSIGNED_MASK=np.zeros((GRAPH_SIZE,GRAPH_SIZE),dtype=bool) initialize_compressed_capacities[blockspergrid,(BLOCK_SIZE,BLOCK_SIZE)](WEIGHTS_LEFT,WEIGHTS_RIGHT,WEIGHTS_UP,WEIGHTS_DOWN,COMPR_CAP) start=cuda.event(timing=True) stop=cuda.event(timing=True) start.record() begin=time.time() while (not has_converged()) or iters<5: push_right[blockspergrid, (BLOCK_SIZE,1)](WEIGHTS_RIGHT,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,0) cuda.synchronize() push_down[blockspergrid, (BLOCK_SIZE,1)](WEIGHTS_DOWN,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,1) cuda.synchronize() push_left[blockspergrid, (BLOCK_SIZE,1)](WEIGHTS_LEFT,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,2) cuda.synchronize() push_up[blockspergrid, (BLOCK_SIZE,1)](WEIGHTS_UP,EXCESS_FLOW,EXCESS_TILE,HEIGHTS,ACTIVE_TILES,ACTIVE_NODES,COMPR_CAP,3) cuda.synchronize() HEIGHTS_BUFFER=np.copy(HEIGHTS) HEIGHTS.fill(0) local_relabel[blockspergrid, (BLOCK_SIZE,BLOCK_SIZE)](EXCESS_FLOW, COMPR_CAP, HEIGHTS, HEIGHTS_BUFFER, ACTIVE_NODES) cuda.synchronize() if iters%6==0: isover=True ASSIGNED_MASK.fill(0) global_bfs_t[(GRID_SIZE,GRID_SIZE),(BLOCK_SIZE,BLOCK_SIZE)](HEIGHTS,EXCESS_FLOW,PF,PB,ASSIGNED_MASK) cuda.synchronize() gl_count=1 while (isover): isover=False global_bfs[(GRID_SIZE,GRID_SIZE),(BLOCK_SIZE,BLOCK_SIZE)](HEIGHTS,COMPR_CAP,gl_count,isover,ASSIGNED_MASK) gl_count=gl_count+1 cuda.synchronize() reduce_active_nodes[1,(GRID_SIZE, GRID_SIZE)](ACTIVE_NODES,ACTIVE_TILES,BLOCK_SIZE) cuda.synchronize() iters=iters+1 print("This is iteration %d"%iters) end=time.time() print("-Done in "+str(end-begin)) stop.record() stop.synchronize() GPU_TIME=cuda.event_elapsed_time(start,stop) print("It took %d iterations to converge"%iters) print("Total GPU time: %dms"%(GPU_TIME)) def generate_image_seg_mask(WL,WR,WU,WD,img): WR_c=np.where(WR>0,0,1) WL_c=np.where(WR>0,0,1) WU_c=np.where(WR>0,0,1) WD_c=np.where(WR>0,0,1) Mask1=WR_c+WL_c Mask2=WU_c+WD_c Mask=Mask1+Mask2 Mask=np.where(Mask>0,1,0) Mask=np.multiply(Mask,255) # use white color for segmentation boudry img=img+Mask img=np.where(img>255,255,img) cv2.imwrite('seg.png',img) #test with small graph, wait, need to count active tiles if __name__ == "__main__": img, WEIGHTS_UP, WEIGHTS_DOWN, WEIGHTS_LEFT, WEIGHTS_RIGHT, EXCESS_FLOW, PF, PB=build_graph(GRID_SIZE, BLOCK_SIZE,img,WEIGHTS_UP,WEIGHTS_DOWN,WEIGHTS_LEFT,WEIGHTS_RIGHT,EXCESS_FLOW,PF,PB) ORG_WU = np.copy(WEIGHTS_UP) ORG_WD = np.copy(WEIGHTS_DOWN) ORG_WL = np.copy(WEIGHTS_LEFT) ORG_WR = np.copy(WEIGHTS_RIGHT) maxflow() generate_image_seg_mask(WEIGHTS_LEFT,WEIGHTS_RIGHT,WEIGHTS_UP,WEIGHTS_DOWN,img) #SEG = mincut(WEIGHTS_UP, WEIGHTS_DOWN, WEIGHTS_LEFT, WEIGHTS_RIGHT, ORG_WU, ORG_WD, ORG_WL, ORG_WR, SEG, GRID_SIZE, BLOCK_SIZE) wl=0 wr=0 wu=0 wd=0 for j in range(len(img)): for i in range(len(img[0])): if ORG_WU[j,i]!=0 and WEIGHTS_UP[j,i]==0: wu=wu+1 if ORG_WD[j,i]!=0 and WEIGHTS_DOWN[j,i]==0: wr=wr+1 if ORG_WL[j,i]!=0 and WEIGHTS_LEFT[j,i]==0: wl=wl+1 if ORG_WR[j,i]!=0 and WEIGHTS_RIGHT[j,i]==0: wd=wd+1 print("num of WL changed: %d" % wl) print("num of WR changed: %d" % wr) print("num of WU changed: %d" % wu) print("num of WD changed: %d" % wd)