并行计算架构和编程 | Assignment 1: Performance Analysis on a Quad-Core CPU

news/2025/3/22 14:57:00/文章来源:https://www.cnblogs.com/cilinmengye/p/18786580


from pixivv

Assignment 1: Performance Analysis on a Quad-Core CPU

Environment Setup

  • CPU信息

    Architecture:             x86_64CPU op-mode(s):         32-bit, 64-bitAddress sizes:          46 bits physical, 57 bits virtualByte Order:             Little Endian
    CPU(s):                   160On-line CPU(s) list:    0-159
    Vendor ID:                GenuineIntelModel name:             Intel(R) Xeon(R) Platinum 8383C CPU @ 2.70GHzCPU family:           6Model:                106Thread(s) per core:   2Core(s) per socket:   40Socket(s):            2Stepping:             6CPU max MHz:          3600.0000CPU min MHz:          800.0000BogoMIPS:             5400.00
  • install the Intel SPMD Program Compiler (ISPC) available here: http://ispc.github.io/

    wget https://github.com/ispc/ispc/releases/download/v1.26.0/ispc-v1.26.0-linux.tar.gztar -xvf ispc-v1.26.0-linux.tar.gz && rm ispc-v1.26.0-linux.tar.gz# Add the ISPC bin directory to your system path. 
    export ISPC_HOME=/home/cilinmengye/usr/ispc-v1.26.0-linux
    export PATH=$ISPC_HOME/bin:$PATH
  • The assignment starter code is available on https://github.com/stanford-cs149/asst1

Program 1: Parallel Fractal Generation Using Threads (20 points)

**Is speedup linear in the number of threads used? **

In your writeup hypothesize why this is (or is not) the case?

(you may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread datapoint.)



To confirm (or disprove) your hypothesis, measure the amount of time each thread requires to complete its work by inserting timing code at the beginning and end of workerThreadStart()


[mandelbrot thread 0]:          [333.949] ms
[mandelbrot thread 1]:          [356.226] ms[mandelbrot thread 0]:          [131.044] ms
[mandelbrot thread 2]:          [154.097] ms
[mandelbrot thread 1]:          [428.861] ms[mandelbrot thread 0]:          [62.530] ms
[mandelbrot thread 3]:          [83.256] ms
[mandelbrot thread 1]:          [291.673] ms
[mandelbrot thread 2]:          [292.539] ms[mandelbrot thread 0]:          [28.683] ms
[mandelbrot thread 4]:          [50.483] ms
[mandelbrot thread 1]:          [193.241] ms
[mandelbrot thread 3]:          [193.720] ms
[mandelbrot thread 2]:          [288.939] ms[mandelbrot thread 0]:          [17.357] ms
[mandelbrot thread 5]:          [37.675] ms
[mandelbrot thread 1]:          [134.275] ms
[mandelbrot thread 4]:          [134.862] ms
[mandelbrot thread 2]:          [223.723] ms
[mandelbrot thread 3]:          [224.468] ms[mandelbrot thread 0]:          [13.236] ms
[mandelbrot thread 6]:          [32.660] ms
[mandelbrot thread 1]:          [95.954] ms
[mandelbrot thread 5]:          [97.984] ms
[mandelbrot thread 2]:          [166.271] ms
[mandelbrot thread 4]:          [167.878] ms
[mandelbrot thread 3]:          [214.565] ms[mandelbrot thread 0]:          [9.987] ms
[mandelbrot thread 7]:          [28.890] ms
[mandelbrot thread 1]:          [75.118] ms
[mandelbrot thread 6]:          [75.412] ms
[mandelbrot thread 2]:          [130.862] ms
[mandelbrot thread 5]:          [131.121] ms
[mandelbrot thread 3]:          [185.480] ms
[mandelbrot thread 4]:          [186.073] ms


[mandelbrot thread 1]:          [177.231] ms
[mandelbrot thread 0]:          [226.604] ms[mandelbrot thread 2]:          [120.581] ms
[mandelbrot thread 1]:          [128.685] ms
[mandelbrot thread 0]:          [174.056] ms[mandelbrot thread 3]:          [98.259] ms
[mandelbrot thread 1]:          [101.493] ms
[mandelbrot thread 2]:          [102.176] ms
[mandelbrot thread 0]:          [147.582] ms[mandelbrot thread 5]:          [68.970] ms
[mandelbrot thread 4]:          [73.143] ms
[mandelbrot thread 2]:          [73.587] ms
[mandelbrot thread 3]:          [76.763] ms
[mandelbrot thread 1]:          [82.289] ms
[mandelbrot thread 0]:          [112.975] ms[mandelbrot thread 6]:          [59.428] ms
[mandelbrot thread 2]:          [60.442] ms
[mandelbrot thread 4]:          [63.801] ms
[mandelbrot thread 5]:          [66.606] ms
[mandelbrot thread 3]:          [71.558] ms
[mandelbrot thread 1]:          [78.108] ms
[mandelbrot thread 0]:          [100.265] ms[mandelbrot thread 7]:          [53.217] ms
[mandelbrot thread 5]:          [57.710] ms
[mandelbrot thread 2]:          [57.926] ms
[mandelbrot thread 3]:          [61.531] ms
[mandelbrot thread 4]:          [62.739] ms
[mandelbrot thread 6]:          [63.645] ms
[mandelbrot thread 1]:          [77.445] ms
[mandelbrot thread 0]:          [90.286] ms

可以通过对比VIEW 1和VIEW 2在不同线程数执行时各个线程的运行时间上看,VIEW 1具有严重的负载不均衡问题。特别是在VIEW 1下用3个线程执行时,thread 1运行时间居然高达428.861ms是其他线程运行时间的4倍!

这导致VIEW 1下用3个线程运行比用2个线程运行的加速比还要低!

VIEW 2显示出来了较为良好的负载均衡



上图为VIEW 1生成的PPM,按照我的策略,在使用3个线程执行时,从下到上的三个区域分别由thread0,1,2负责.

判断(x,y)坐标是否在mandelbrot集合中是由代码中static inline int mandel(float c_re, float c_im, int count)函数进行计算的。当算出(x,y)坐标越“接近”mandelbrot集合中,那么图中在(x,y)坐标上显示地越白。

关键是(x,y)坐标越“接近”mandelbrot集合,在mandel函数中迭代得越久(最大为256)。从上图中可以看到VIEW 1 thread1负责的区域相对与thread 0, thread 2有大片的空白,说明thread 1的计算量更大。


来看看VIEW 2的图,可以看到白点的分别就均匀许多了。


Modify the mapping of work to threads to achieve to improve speedup to at about 7-8x on both views of the Mandelbrot set. In your writeup, describe your approach to parallelization and report the final 8-thread speedup obtained.

[mandelbrot thread 0]:          [334.905] ms
[mandelbrot thread 1]:          [355.082] ms[mandelbrot thread 0]:          [223.479] ms
[mandelbrot thread 1]:          [244.310] ms
[mandelbrot thread 2]:          [244.273] ms[mandelbrot thread 0]:          [167.591] ms
[mandelbrot thread 1]:          [188.222] ms
[mandelbrot thread 3]:          [188.149] ms
[mandelbrot thread 2]:          [188.211] ms[mandelbrot thread 0]:          [134.268] ms
[mandelbrot thread 1]:          [155.675] ms
[mandelbrot thread 4]:          [155.588] ms
[mandelbrot thread 3]:          [155.652] ms
[mandelbrot thread 2]:          [155.684] ms[mandelbrot thread 0]:          [111.937] ms
[mandelbrot thread 2]:          [132.946] ms
[mandelbrot thread 4]:          [132.864] ms
[mandelbrot thread 1]:          [132.969] ms
[mandelbrot thread 3]:          [132.941] ms
[mandelbrot thread 5]:          [132.888] ms[mandelbrot thread 0]:          [95.648] ms
[mandelbrot thread 1]:          [116.998] ms
[mandelbrot thread 3]:          [116.925] ms
[mandelbrot thread 2]:          [116.974] ms
[mandelbrot thread 4]:          [116.892] ms
[mandelbrot thread 6]:          [116.812] ms
[mandelbrot thread 5]:          [117.228] ms[mandelbrot thread 0]:          [85.144] ms
[mandelbrot thread 1]:          [104.262] ms
[mandelbrot thread 4]:          [104.145] ms
[mandelbrot thread 2]:          [104.286] ms
[mandelbrot thread 3]:          [104.250] ms
[mandelbrot thread 7]:          [106.611] ms
[mandelbrot thread 5]:          [106.744] ms
[mandelbrot thread 6]:          [106.666] msVIEW 2
[mandelbrot thread 0]:          [191.501] ms
[mandelbrot thread 1]:          [212.256] ms[mandelbrot thread 0]:          [127.668] ms
[mandelbrot thread 2]:          [149.055] ms
[mandelbrot thread 1]:          [149.279] ms[mandelbrot thread 0]:          [95.970] ms
[mandelbrot thread 1]:          [115.653] ms
[mandelbrot thread 3]:          [115.783] ms
[mandelbrot thread 2]:          [115.902] ms[mandelbrot thread 0]:          [76.880] ms
[mandelbrot thread 2]:          [97.456] ms
[mandelbrot thread 1]:          [97.590] ms
[mandelbrot thread 4]:          [97.547] ms
[mandelbrot thread 3]:          [97.708] ms[mandelbrot thread 0]:          [64.118] ms
[mandelbrot thread 3]:          [83.671] ms
[mandelbrot thread 4]:          [83.687] ms
[mandelbrot thread 2]:          [83.868] ms
[mandelbrot thread 1]:          [84.021] ms
[mandelbrot thread 5]:          [83.885] ms[mandelbrot thread 0]:          [55.046] ms
[mandelbrot thread 6]:          [75.713] ms
[mandelbrot thread 5]:          [75.799] ms
[mandelbrot thread 4]:          [75.939] ms
[mandelbrot thread 3]:          [76.110] ms
[mandelbrot thread 2]:          [76.357] ms
[mandelbrot thread 1]:          [76.464] ms[mandelbrot thread 0]:          [48.182] ms
[mandelbrot thread 7]:          [68.148] ms
[mandelbrot thread 6]:          [68.219] ms
[mandelbrot thread 5]:          [68.308] ms
[mandelbrot thread 4]:          [68.495] ms
[mandelbrot thread 3]:          [68.535] ms
[mandelbrot thread 2]:          [68.653] ms
[mandelbrot thread 1]:          [68.736] ms


  1. 按照行进行划分区域,然后使用轮转的策略让不同线程负责不同的行,如下:

  2. 按照行进行划分区域,然后使用轮转的策略让不同线程负责不同的行,但是不按照固定顺序,如下:

  3. 按照点进行划分区域,然后使用轮转的策略让不同线程负责不同的点,如下

    0 1 2 0 1 2
    0 1 2 0 1 2


threadNum = np.array([k for k in range(2, 9)])
speedUpV1 = np.array([1.98, 1.62, 2.42, 2.46, 3.12, 3.26, 3.79])
speedUpV2 = np.array([1.89, 2.43, 2.87, 3.29, 3.74, 4.19, 4.68])
speedUpV1_V1 = np.array([2.00, 2.89, 3.76, 4.55, 5.32, 6.02, 6.67])
speedUpV1_V2 = np.array([1.97, 2.82, 3.58, 4.31, 4.96, 5.54, 6.04])
plt.plot(threadNum, speedUpV1, marker = 'o', label = 'methord1_V1')
plt.plot(threadNum, speedUpV2, marker = 'o', label = 'methord1_V2')
plt.plot(threadNum, speedUpV1_V1, marker = 'o', label = 'methord2_V1')
plt.plot(threadNum, speedUpV1_V2, marker = 'o', label = 'methord2_V2')



Now run your improved code with 16 threads. Is performance noticably greater than when running with eight threads? Why or why not?


6.73x speedup from 8 threads
11.16x speedup from 16 threads



BUG1: 浮点数计算的精度问题



    for (int j = startRow; j < endRow; j++) {for (int i = 0; i < width; ++i) {float x = x0 + i * dx;float y = y0 + j * dy;int index = (j * width + i);output[index] = mandel(x, y, maxIterations);}}

serial版本和thread版本中,dy,dx的值分别一样,但是在serial中,当 y0 = -1, j = 601时, 计算出来的y 和 在thread中, 当 y0 = -1 + 600 * dy, j = 1时, 计算出来的y,都结果不一样。


Program 2: Vectorizing Code Using SIMD Intrinsics (20 points)

Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. You can do this by changing the #define VECTOR_WIDTH value in CS149intrin.h. Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?


结果看起来是decrease的,首先我们要搞清楚Vector Utilization的计算方式:

\(Vector \ Utilization = \frac{Utilized \ Vector \ Lanes}{Total \ Vector \ Lanes}\)

\(Total \ Vector \ Lanes = Total \ Vector \ Instructions * Vector \ Width\)

同时有很多因素导致在一次Vector指令操作时,Vector Lanes不能得到充分利用:

  1. 分支判断if
  2. 循环while

这些语句总是会导致lane会有停用等待的情况,当Vector Width成倍数增长时,Total Vector Instructions并非成倍数的下降,Utilized Vector Lanes也并非成倍数的上升




Contest3923 - 计科23级算法设计与分析上机作业-03

A.质数 题面思路 考虑到输入数据量较大,选择线性欧拉筛预处理 示例代码 #include<bits/stdc++.h>using namespace std;#define ll long long //#define int ll #define pii pair<int, int> #define all(x) x.begin(),x.end() #define fer(i, m, n) for(int i = m;…

leetcode 4. 两个有序数组的中位数(第k大的数)

假设有前 k 小的数,分配到两个数组中综上, 前k-1数的边界偏离(k-1)/2 时,由于大于(k-1)数边界的挤压会伴随小于k的数的边界的外延, 其在(k-1)/2会呈现一方比另一方大的情况,可以直接判定小的一方在小于k的数的边界内 而当k-1数正好在边界内,则同样可以判定小的数在小于k的…

20241227曹鹏泰 python1

课程:《Python 程序设计》 班级: 2412 姓名: 曹鹏泰 学号: 20241227 实验教师:王志强 实验日期:2025 年 3 月 12 日 必修/选修: 公选课 一、实验内容 熟悉 Python 开发环境; 练习 Python 运行、调试技能(编写书中的程序,并进行调试分析,要有过程); 编写程序…


问题现象:ospfv3建立邻居后发现部分路由丢失原因:抓包查看时发现对端华为设备发送的5类LSA报文中ADV为全0,设备将LSA加到LSDB后,未将其加到边界路由表,导致下发路由中缺失部分路由 临时处理办法:下发边界路由时检查LSDB中是否存在ADV为全0的5类LSA,存在则查找LSDB,得到…


FristiLeaks_1.3 环境搭建 下载:https://download.vulnhub.com/fristileaks/FristiLeaks_1.3.ova 导入后将mac地址修改为:08:00:27:A5:A6:76信息收集 扫描主机ip ┌──(root㉿kali)-[~] └─# arp-scan -l Interface: eth0, type: EN10MB, MAC: 00:0c:29:84:b2:cc, IPv4: 1…


问题描述设备:绿联nas dxp4800 系统:ugnas pro 绿联新系统在12月份更新后,原本用nginx代理的alist,青龙等服务全都连接不上,在ugnas系统防火墙设置如下:对外只通过80端口,其他docker服务都只能通过nginx反代访问,系统更新前一直都没问题。 问题排查 经过反复排查发现关…

20244119 实验一 《Python程序设计》 实验报告

课程:《Python程序设计》 班级: 2441 姓名: 霍彬斌 学号:20244109 实验教师:王志强 必修/选修: 公选课 一、实验内容 熟悉Pycharm等开发环境; 掌握基本的Python运行和调试技能; 掌握基本的Python编程技能。 二、实验过程及结果 1.熟悉Python开发环境; 本次实验使用pyc…


环境准备 硬件环境 CPU:intel四代至强及以上,AMD参考同时期产品 内存:800GB以上,内存性能越强越好,建议DDR5起步 显卡:Nvidia显卡,单卡显存至少24GB(用T4-16GB显卡实测会在加载模型过程中爆显存),nvidia compute capability至少8.0(CUDA GPUs - Compute Capability …


课程:《Python程序设计》 班级: 2441 姓名: 霍彬斌 学号:20244109 实验教师:王志强 必修/选修: 公选课 一、实验内容 熟悉Pycharm等开发环境; 掌握基本的Python运行和调试技能; 掌握基本的Python编程技能。 二、实验过程及结果 1.熟悉Python开发环境; 本次实验使用pyc…

WebSocket系列 注册 @ServerEndpoint类失败

WebSocket系列—注册 @ServerEndpoint类失败 目录WebSocket系列—注册 @ServerEndpoint类失败一、问题背景二、寻找问题三、解决问题3.1、自己定义的切面3.2、外部框架的切面四、参考博客五、WebSocket系列地址 一、问题背景 博主最近分到后端主动推送报警业务,调研了一圈(轮…

实验1 C语言输入输出和简单程序编写补充

任务二:判断它能否构成三角形 #include <stdio.h> int main(){ double a, b, c; scanf_s("%lf%lf%lf", &a, &b, &c); if ((a + b > c) && (a + c > b) && (b + c > a)) printf("能构成三角…


ASE15N45-ASEMI智能家居专用ASE15N45编辑:ll ASE15N45-ASEMI智能家居专用ASE15N45 型号:ASE15N45 品牌:ASEMI 封装:TO-220 批号:最新 最大漏源电流:15A 漏源击穿电压:450V RDS(ON)Max:0.38Ω 引脚数量:3 沟道类型:N沟道MOS管、中低压MOS管 漏电流:ua 特性:N沟道M…