CODES踩坑坑坑坑坑记录(更新中)

简介

CODES是一个大规模网络并行模拟器,可以利用它进行多种网络拓扑、路由、调度方式、通信模式的研究。但是我在实际使用过程中踩了很多很多坑(手动捂脸),因为这个东西的文档实在是不怎么完善(也有可能是我没找到),通过大量的阅读源码总算搞明白了一些,本文将介绍如何使用CODES模拟器。

后记:然后使用过程中还发现还有一些bug,还修了很久的bug……猛男落泪

声明

这个东西我也没完全弄懂,后面有的使用方法和解释可能是错的,可能我觉得很坑的地方可能只是我理解不到位,或者我使用方法不对,我不对这篇博文中的任何内容的正确性做保证,请见谅。

前言

如果你打算使用这个模拟器来模拟胖树拓扑的话,那我觉得你现在可以关掉这个页面了,因为经过漫长的摸索,我仍然不知道如何优雅的使用这个模拟器来模拟胖树,我觉得它有着这些问题:①LP到节点的映射非常不科学,会导致扩展性奇差无比。②你只能按照它的设定的模式去构造胖树,如果你的胖树有一些特殊,比如你的胖树是4层的,那么是不行的,你需要大改代码,基本上等于重写了。③我忘了,想起来再补。//TODO

除此之外我也只试过用这个模拟器来模拟Dragonfly拓扑,如果是其他拓扑是个什么情况我也不清楚。

知识铺垫

  1. CODES是一款基于ROSS的模拟器,如果你不知道ROSS是什么,请先看ROSS入门指北
  2. 本文将主要以使用Dragonfly拓扑作为例子,如果你不知道Dragonfly拓扑是什么,请先看Dragonfly拓扑简介

参考资料

大部分的参考资料都来自于https://github.com/codes-org/codes/wiki 注意虽然CODES在gitlab上也有仓库,但是这个仓库在2019年6月就已经停止维护了,不要找错门了

又一个但是,在github上wiki有的页面会缺图,但是在gitlab上的仓库可以看到https://xgitlab.cels.anl.gov/codes/codes

官网:https://www.mcs.anl.gov/projects/codes/downloads/

CODES的原始论文是这一篇《CODES: Co-Design of Multi-layer Exascale Storage Architecture》

快速上手教程里用到的3个算例都是这篇论文里面的《Enabling parallel simulation of large-scale HPC network systems》

基于CODES的一些工作可以在这里找到:https://www.mcs.anl.gov/projects/codes/publications/

安装方法

环境

module load IMPI/2018.1.163-icc-18.0.1 cmake/3.14.3-gcc-4.8.5

目录结构

- ROSS
	- install
- dumpi-7.1.0
	- install
- dumpi-cortex-master
	- install
- codes
	- build
	- workspace # 执行路径
- testcase
	- AMG
	- conf # 配置

所有codes的执行路径都在codes/workspace

DUMPI安装

参考 https://github.com/sstsimulator/sst-dumpi

tar -xvf dumpi-7.1.0.tar.gz
cd dumpi-7.1.0
mkdir build
cd build
../configure \
  --prefix=`pwd`/../install \
  --enable-libdumpi \
  --enable-libundumpi \
  CC=mpiicc \
  CFLAGS="-O3 -xhost -ipo" \
  CXX=mpiicpc \
  CXXFLAGS="-O3 -xhost -ipo"
make -j
make -j check
make -j install

注意!这玩意是有bug的

需要修改dumpi\common\iodefs.h的426行-469行,把16位的限制全部解除!不然会在ascii2dumpi的时候无法处理大于65535的时间戳。

/** Utility routine write timestamps to the stream */
static inline void put_times(dumpi_profile *profile,
	       const dumpi_time *cpu, const dumpi_time *wall,
	       uint8_t config_mask)
{
  if(DO_TIME_CPU(config_mask)) {
    put32(profile, (cpu->start.sec - profile->cpu_time_offset));
    put32(profile, cpu->start.nsec);
    put32(profile, (cpu->stop.sec - profile->cpu_time_offset));
    put32(profile, cpu->stop.nsec);
  }
  if(DO_TIME_WALL(config_mask)) {
    put32(profile, (wall->start.sec - profile->wall_time_offset));
    put32(profile, wall->start.nsec);
    put32(profile, (wall->stop.sec - profile->wall_time_offset));
    put32(profile, wall->stop.nsec);
  }
}

/* Utility routine to read timestamps from the stream */
static inline void get_times(dumpi_profile *profile, dumpi_time *cpu,
	       dumpi_time *wall, uint8_t config_mask)
{
  if(DO_TIME_CPU(config_mask)) {
    cpu->start.sec  = get16(profile) + profile->cpu_time_offset;
    cpu->start.nsec = get32(profile);
    cpu->stop.sec   = get16(profile) + profile->cpu_time_offset;
    cpu->stop.nsec  = get32(profile);
  }
  else {
    cpu->start.sec = cpu->start.nsec = 0;
    cpu->stop.sec  = cpu->stop.nsec  = 0;
  }
  if(DO_TIME_WALL(config_mask)) {
    wall->start.sec  = get32(profile) + profile->wall_time_offset;
    wall->start.nsec = get32(profile);
    wall->stop.sec   = get32(profile) + profile->wall_time_offset;
    wall->stop.nsec  = get32(profile);
  }
  else {
    wall->start.sec = wall->start.nsec = 0;
    wall->stop.sec  = wall->stop.nsec  = 0;
  }
}

Cortex安装

官方gitlab仓库:https://xgitlab.cels.anl.gov/mdorier/dumpi-cortex

tar -xvf dumpi-cortex-master-2020.10.18.tar.gz
cd dumpi-cortex-master
mkdir build
cd build
cmake .. \
  -G "Unix Makefiles" \
  -DMPICH_FORWARD:BOOL=TRUE \
  -DCMAKE_INSTALL_PREFIX=`pwd`/../install \
  -DDUMPI_ROOT=`pwd`/../../dumpi-7.1.0/install \
  -DCMAKE_C_COMPILER=icc \
  -DCMAKE_C_FLAGS="-O3 -xhost" \
  -DCMAKE_CXX_COMPILER=icpc \
  -DCMAKE_CXX_FLAGS="-O3 -xhost" \
  -LH
make -j
make -j install

CODES安装

参考 https://github.com/codes-org/codes/wiki/tutorial-installation

另外,我这里装的是develop分支上2021年6月7日的版本https://github.com/codes-org/codes/tree/6ac258d2bc214bc0406fe18ede4bdc8b6436a496。为什么没装最新的发行版`codes-1.2`是因为新分支上似乎有一些新的特性。

这里非常坑的一点是,从codes-1.2开始在dragonfly网络中似乎移除了对cortex的支持,因此如果要在dragonfly拓扑中用cortex的话要用codes-1.1.1版本

在CODES-1.2安装时,如果需要使用cortex的话需要在src/networks/model-net/dragonfly-dally.C第34行中增加#undef ENABLE_CORTEX,否则会过不了编译,具体修改可以参见这里

cd codes
bash prepare.sh
mkdir build
cd build
../configure \
  --prefix=`pwd`/../install \
  PKG_CONFIG_PATH=`pwd`/../../ROSS/install/lib/pkgconfig \
  --with-dumpi=`pwd`/../../dumpi-7.1.0/install \
  --with-cortex=`pwd`/../../dumpi-cortex-master/install \
  CC=mpiicc `# 记得要改这里` \
  CXX=mpiicpc `# 记得要改这里` \
  CFLAGS="-O3 -xhost" \
  CXXFLAGS="-O3 -xhost" # 这里不能加-ipo,否则编译不过
make -j # 可能需要多来几次
make -j check
make -j install

我发现的BUG

我在使用的过程中遇到了如下的一些BUG,建议先进行这些修改,不然代码可能会出现一些奇奇怪怪的报错。

  1. dragonfly-dally.C中需要加一句#undef ENABLE_CORTEX,因为并没有实现,关闭了才能过所有的编译,详情见这里
  2. 一些默认的网络拓扑的配置文件中的message_size和代码中的设置是不匹配的,如果要用这些配置文件,改了才能跑起来,详情见这里
  3. model-net-lp.c需要增加对end_evend_ev_rc的非空判断,因为可能并没有实现,不然无法正常的模拟Dragonfly拓扑,详情见这里
  4. model-net-mpi-replay.c中有一些输出会导致在大规模并行时输出过多,详情见这里
  5. assert(other_id != jid.job);需要被注释掉,不然当你模拟多个作业时,如果第一个作业是synthetic类型时就会出问题,详情见这里
  6. max_elapsed_time_per_job[i]仅包含当前运行进程上作业的运行时间,需要手动价格Reduce,详情见这里
  7. 源码src/networks/model-net/dragonfly-custom.C中对dragonfly_total_time的计算会有较大浮点累加误差,需要调整计算方式,详情见这里
  8. 源码src/networks/model-net/dragonfly-custom.C中3446行附近,if(msg->packet_size < s->params->chunk_size)应改为if(cur_entry->msg.packet_size < s->params->chunk_size),不然会导致多次运行结果不一致。
  9. 源码src/networks/model-net/dragonfly-custom.C中对num_chunks的计算感觉有问题,详情见这里这里
  10. 异常的内存占用,见本文《内存占用》一节
  11. ROSS中tw_eventid的数据类型是unsigned int,这样会导致模拟时间一长了,就崩了,改成unsigned long long即可。
  12. ROSS中NUM_OUT_MESG值为2000,如果短时间内输出较多就会爆了,建议改大。
  13. MPI模拟时的时间推算不合理,详情见这里
  14. update_completed_queue_rc()回滚逻辑有问题,详情见这里
  15. MPI_ANY_SOURCE的实现有问题,详情见这里
  16. 各种路由的包的数量统计数据类型不够大,详情见这里
  17. 拥塞控制逻辑有问题,详情见这里

官方教程中的算例

Exercise 0

参考:https://github.com/codes-org/codes/wiki/tutorial-codes-synthetic-workload-development

这个算例似乎在codes-1.2中才自带,在codes1.1.1中没有

cd codes/workspace
mpirun -n 4 ../build/doc/example/tutorial-synthetic-ping-pong \
	--synch=3 \
	-- ../doc/example/tutorial-ping-pong.conf

这个算例只要能正常运行结束就可以了,用来验证安装大致没有出现太大问题。

后面的算例来自https://github.com/codes-org/codes/wiki/quick-start-interconnects,可以由浅入深地学习一下CODES

Exercise 1

该小节练习模拟仅有点对点通信的单个作业

Dragonfly Network Model

3.5K node dragonfly, Theta Configuration

这个算例要修改网络设置配置文件../src/network-workloads/conf/dragonfly-custom/modelnet-test-dragonfly-theta.conf中第44行

cd codes/workspace
vim ../src/network-workloads/conf/dragonfly-custom/modelnet-test-dragonfly-theta.conf
# ROSS message size
   message_size="736";

不然会有如下报错。如果在其他地方遇到相似报错,也是类似的修改方法,这里的修改不会产生什么问题,改只是ROSS中传递消息的大小。

node: 0: error: ../src/networks/model-net/core/model-net.c:322: Error: model_net trying to transmit an event of size 736 but ROSS is configured for events of size 624

然后运行

cd codes/workspace
../build/src/network-workloads/model-net-mpi-replay \
  --sync=1 \
  --num_net_traces=216 \
  --workload_file=../../testcase/AMG/df_AMG_n216_dumpi/dumpi-2014.03.03.14.55.23- \
  --workload_type=dumpi \
  -- ../src/network-workloads/conf/dragonfly-custom/modelnet-test-dragonfly-theta.conf

Fat Tree Network Model

3.5K node fat tree, based on Summit configuration

这个算例同样要修改网络设置配置文件../src/network-workloads/conf/modelnet-mpi-fattree-summit-k36-n3564.conf中第15行

cd codes/workspace
vim ../src/network-workloads/conf/modelnet-mpi-fattree-summit-k36-n3564.conf
# ROSS message size
message_size="736";

不然会出现如下报错

node: 0: error: ../src/networks/model-net/core/model-net.c:322: Error: model_net trying to transmit an event of size 736 but ROSS is configured for events of size 592

然后运行

cd codes/workspace
../build/src/network-workloads/model-net-mpi-replay \
  --disable_compute=1 \
  --sync=1 \
  --num_net_traces=216 \
  --workload_file=../../testcase/AMG/df_AMG_n216_dumpi/dumpi-2014.03.03.14.55.23- \
  --workload_type=dumpi \
  -- ../src/network-workloads/conf/modelnet-mpi-fattree-summit-k36-n3564.conf

Exercise 2

该小节练习模拟有collective通信的单个作业

cd codes/workspace
../build/src/network-workloads/model-net-mpi-replay \
  --disable_compute=1 \
  --sync=1 --debug_cols=1 \
  --num_net_traces=125 \
  --workload_file=../../testcase/MultiGrid_C/MultiGrid_C_n125_dumpi/dumpi-2014.03.06.23.48.13- \
  --workload_type=dumpi \
  -- ../src/network-workloads/conf/modelnet-mpi-fattree-summit-k36-n3564.conf

Exercise 3

该小节练习同时模拟多个作业,每个作业都占据一部分连续的节点

cd codes/workspace
cp ../src/network-workloads/conf/workloads.conf ./
cp ../src/network-workloads/conf/allocation-cont.conf ./
vim workloads.conf
vim allocation-cont.conf

workloads.conf要修改第二行算例的路径和其他参数,如下。不难看出,这里使用216个节点模拟网络中其他通信的噪音,125个节点模拟我们的作业MultiGrid_C_n125_dumpi。另外在develop分支上的版本,还需要多填两个参数,其中第一个表示qos等级,第二个表示synthetic作业中通信时每间隔多久通信一次,如果不加这两个参数,程序会没有任何报错地挂掉。

216 synthetic 0 1000
125 ../../testcase/MultiGrid_C/MultiGrid_C_n125_dumpi/dumpi-2014.03.06.23.48.13- 0 1000

allocation-cont.conf要删掉第三行,因为从workloads.conf中仅有2个算例,否则会出现如下报错

tmp is NULL, rank=82, app_id = 2model-net-mpi-replay: ../src/workload/codes-workload.c:237: codes_workload_get_next: Assertion `tmp' failed.
Aborted

运行

cd codes/workspace
../build/src/network-workloads/model-net-mpi-replay \
  --disable_compute=1 \
  --sync=1 \
  --workload_conf_file=./workloads.conf \
  --alloc_file=./allocation-cont.conf \
  --workload_type=dumpi \
  -- ../src/network-workloads/conf/modelnet-mpi-fattree-summit-k36-n3564.conf

Exercise 4

该小节练习同时模拟多个作业,每个作业的节点为随机分配节点

继续用前面的workloads.conf,然后使用随机分配的节点

cd codes/workspace
cp ../src/network-workloads/conf/allocation-random.conf ./
vim allocation-cont.conf

allocation-random.conf要删掉第三行,因为从workloads.conf中仅有2个算例,否则会出现如下报错

tmp is NULL, rank=79, app_id = 2model-net-mpi-replay: ../src/workload/codes-workload.c:237: codes_workload_get_next: Assertion `tmp' failed.
Aborted

运行

cd codes/workspace
../build/src/network-workloads/model-net-mpi-replay \
  --disable_compute=1 \
  --sync=1 \
  --workload_conf_file=./workloads.conf \
  --alloc_file=./allocation-cont.conf \
  --workload_type=dumpi \
  -- ../src/network-workloads/conf/modelnet-mpi-fattree-summit-k36-n3564.conf

Exercise 5

将Exercise4中的运行增加--synch=3参数并使用mpirun,这样就可以并行模拟了

Exercise 6

本小节是讲mean_interval参数的使用,但是我看了下代码,这个参数似乎并没有用了。

Dragonfly拓扑配置

CODES中的Dragonfly有好几套模拟方案,分别是

  • dragonfly_dally:可以模拟最朴素的dragonfly拓扑,即论文Technology-Driven, Highly-Scalable Dragonfly Topology中的拓扑。这里的“Dally”指的是论文的一位作者大佬的姓。如果你不知道Dragonfly是什么拓扑,请先去看Dragonfly拓扑简介
  • dragonfly_custom:可以模拟Cray XC系列的Dragonfly拓扑。
  • dragonfly_plus: TODO,我还没整明白

这一节将分别介绍这3种方案。需要注意的是,从CODES1.2版本开始,只有dragonfly_custom支持Cortex,但是dragonfly_dally和dragonfly_plus支持拥塞控制。

dragonfly_dally

参考:https://github.com/codes-org/codes/wiki/codes-dragonfly-old

step1

确定拓扑参数

  • router_radix:交换机端口数

  • num_conn_between_groups:每两个Group之间的连接数

后面设置dragonfly_dally的拓扑需要用到自带脚本scripts/dragonfly-dally/dragonfly-dally-topo-gen.py来生成拓扑连接。通常来说,dragonfly拓扑的a,p,h是可以自定义的,但是由于这个脚本限制,必须要求 a = 2p = 2h 。改一下这个脚本或许可以实现突破这个限制,不过我没试过。通过这两个参数可以推出

  • 每个group中交换机的数量:a=(router_radix+ 1)/2
  • 每个交换机连接的节点数:p=a/2
  • 每个交换机的全局连接数:h=a/2
  • group数量:g=(a*h/num_conn_between_groups+1)

step2

使用脚本scripts/dragonfly-dally/dragonfly-dally-topo-gen.py生成拓扑文件

python3 dragonfly-dally-topo-gen.py 31 1 ../../src/network-workloads/conf/dragonfly-dally/dfdally_16k_intra ../../src/network-workloads/conf/dragonfly-dally/dfdally_16k_inter

Step3

修改网络配置文件,我是直接以src/network-workloads/conf/dragonfly-dally/dfdally_8k.conf作为模板修改的

这里直接贴配置文件出来

LPGROUPS
{
    MODELNET_GRP
    {
        # total nodes = 129groups * 16routers * 8nodes = 16512nodes
        repetitions="2064";
        # name of this lp changes according to the model
        nw-lp="8";
        # these lp names will be the same for dragonfly-custom model
        modelnet_dragonfly_dally="8";
        modelnet_dragonfly_dally_router="1";
    }
}
PARAMS
{
# packet size in the network
   packet_size="4096";
   modelnet_order=( "dragonfly_dally","dragonfly_dally_router" );
   # scheduler options
   modelnet_scheduler="fcfs";
# chunk size in the network (when chunk size = packet size, packets will not be
# divided into chunks)
   chunk_size="512";
   # modelnet_scheduler="round-robin";
   num_routers="16";
   num_groups="129";
# buffer size in bytes for local virtual channels
   local_vc_size="16384";
#buffer size in bytes for global virtual channels
   global_vc_size="16384";
#buffer size in bytes for compute node virtual channels
   cn_vc_size="32768";
#bandwidth in GiB/s for local channels
   local_bandwidth="14";
# bandwidth in GiB/s for global channels
   global_bandwidth="14";
# bandwidth in GiB/s for compute node-router channels
   cn_bandwidth="14";
# ROSS message size
   message_size="768";
# number of compute nodes connected to router, dictated by dragonfly config
# file
   num_cns_per_router="8";
# number of global channels per router
   num_global_channels="8";
# network config file for intra-group connections
   intra-group-connections="../src/network-workloads/conf/dragonfly-dally/dfdally_16k_intra";
# network config file for inter-group connections
   inter-group-connections="../src/network-workloads/conf/dragonfly-dally/dfdally_16k_inter";
# routing protocol to be used
   routing="prog-adaptive";
   adaptive_threshold="16384";
   route_scoring_metric="delta";
}

重要提示

  • 这里的num_global_channels指的是一个router有总共几条链路是与其他Group连接的,而不是两个不同Group之间的router有多少条连接。
  • 代码中和输出提示中的cn_radix指的是一个交换机有几条链路连接到节点,计算公式是p->cn_radix = (p->num_cn * p->num_rails) / p->num_planes; 大多数情况下就是一个交换机连接到几个节点,即输入配置文件中的num_cns_per_router,代码中的num_cn
  • num_routers指的是一个group中有几个交换机
  • num_qos_levels可以设置几种QoS的速度,具体看代码,挺有意思的
  • 一般来说total_groupsnum_groups是一个意思,除非有多个num_planes
  • 输出中的virtual radix就是交换机端口数的意思

路由

minimal

单纯的最短路由,有多条最短路时随机选取其中一个

nonminimal

Coloquially: “Valiant Group Routing”

This follows the randomized indirect routing algorithm detailed in “Cost-Efficient Dragonfly topology for Large-Scale Systems” and

“Technology-Driven, Highly-Scalable Dragonfly Topology” by Kim, Dally, Scott, and Abts

They sourced it from “A scheme for fast parallel communication” by L.G. Valiant

It differs from true valiant routing in that it randomly selects a GROUP and routes to it - not a random intermediate router

adaptive

没有实现标准的的adaptive算法

prog-adaptive

Uses PAR algorithm

src\networks\model-net\dragonfly-dally.C的函数dfdally_prog_adaptive_routing()

注意:代码中的注释不一定是对的!

smart-*

用来解决局部链路失效问题的算法,性能会低一些

SMART ROUTING: Smart routing is my term for failure aware routing. Leverages the network manager’s ability to filter out invalid connections and to return only valid paths between endpoints.

It’s significantly slower in runtime, typically, and is very experimental at this point. Use at own risk.

when using this function, you should assume that the self router is NOT the destination. That should be handled elsewhere.

prog-adaptive-legacy

据说有bug,不要用

LEGACY DRAGONFLY DALLY ROUTING - Sept. 2019, the state of dragonfly routing became difficult to maintain.

There is little documentation to justify decisions in the code and consequently it is near impossible to tell

the difference between intended behavior and a bug. This is not sustainable and lowers confidence in the model.

Getting the routing functions all up to date with the latest updates to the model was very challenging and

the decision was made to rewrite the routing from the ground up. We didn’t want to completely scrap what had

been done previously as we wanted to still be able to recreate experiments with the same routing strategy as

before. So the old routing functions have been preserved with minimal modifications to work with the latest

data structures used by the model. Specify: routing=”prog-adaptive-legacy” in the configuration file to use.

USE AT OWN RISK. One should consider support for this routing algorithm ended. From thorough analysis, I believe

that there are bugs and unintended behavior in this code but I do not have proper documentation to say for

certain. Thanks, -NM

dragonfly_custom

参考https://github.com/codes-org/codes/wiki/codes-dragonfly

这是Cray XC系列中用到的变种的Dragonfly拓扑,如果你不知道这是什么,请先去看Dragonfly拓扑简介

Step1

确定拓扑参数

  • g: number of groups in the network
  • r: number of router rows within a group
  • c: number of router columns within a group
  • can: connections across groups (number of redundant channels between groups)
  • cir: connections in row (number of channels between two routers in same row)
  • cic: connections in column ( number of channels between two routers in same column)

这里特别注意circic的含义,和直观上相反

Step2

使用脚本connections_general_patched生成拓扑连接文件,它位于scripts/gen-cray-topo/connections_general_patched.c,需要手动编译才能使用

它的具体运行参数如下

./connections_general_patched g r c can cir cic intra-file inter-file 

前6个参数见Step1中的描述,intra-file是描述组内连接的配置文件,inter-file是描述组间连接的配置文件。

如果我们要生成一个满配Cray XC网络,方法如下

icc connections_general_patched.c -O3 -std=c99 -o connections_general_patched

./connections_general_patched 241 6 16 4 1 3 ../../src/network-workloads/conf/dragonfly-custom/intra-full ../../src/network-workloads/conf/dragonfly-custom/inter-full

Step3

修改网络配置文件,我是直接以src/network-workloads/conf/dragonfly-custom/modelnet-test-dragonfly-theta.conf作为模板修改的

这里直接贴配置文件出来

LPGROUPS
{
   MODELNET_GRP
   {
      # Totol nodes = repetitions * 4 = 92544
      # repetitions = 6 * 16 * 241
      repetitions="23136";
      # nw-lp = num of nodes connect to a router
      nw-lp="4";
      # modelnet_dragonfly_custom = num of nodes connect to a router
      modelnet_dragonfly_custom="4";
      # keep it 1
      modelnet_dragonfly_custom_router="1";
   }
}
PARAMS
{
   # packet size in the network 
   packet_size="2048";
   modelnet_order=( "dragonfly_custom","dragonfly_custom_router" );
   # scheduler options
   modelnet_scheduler="fcfs";
   # chunk size in the network (when chunk size = packet size, packets will not be divided into chunks)
   chunk_size="2048";
   # number of router in a rows 
   num_router_rows="6";
   # number of router in a columns 
   num_router_cols="16";
   # number of groups in the network
   num_groups="241";
   # buffer size in bytes for local virtual channels 
   local_vc_size="8192";
   #buffer size in bytes for global virtual channels 
   global_vc_size="16384";
   #buffer size in bytes for compute node virtual channels 
   cn_vc_size="8192";
   # bandwidth in GiB/s for local channels 
   local_bandwidth="5.25";
   # bandwidth in GiB/s for global channels 
   global_bandwidth="1.5";
   # bandwidth in GiB/s for compute node-router channels 
   cn_bandwidth="16.0";
   # ROSS message size 
   message_size="768";
   # number of compute nodes connected to router, dictated by dragonfly config
   # file
   num_cns_per_router="4";
   # number of global channels per router 
   num_global_channels="10";
   # network config file for intra-group connections 
   intra-group-connections="../src/network-workloads/conf/dragonfly-custom/intra-full";
   # network config file for inter-group connections
   inter-group-connections="../src/network-workloads/conf/dragonfly-custom/inter-full";
   # routing protocol to be used 
   routing="adaptive";
}

重要提示:这里的num_global_channels指的是一个router有总共几条链路是与其他Group连接的,而不是两个不同Group之间的router有多少条连接。

dragonfly_plus:

TODO,我还没整明白

Synthetic配置

通信模式

Synthetic可以理解为CODES自动产生的网络中的其他应用的通信,CODES支持如下6种不同的synthetic通信模式

/* type of synthetic traffic */
enum TRAFFIC
{
    UNIFORM = 1, /* sends message to a randomly selected node */
    NEAREST_NEIGHBOR = 2, /* sends message to the next node (potentially connected to the same router) */
    ALLTOALL = 3, /* sends message to all other nodes */
    STENCIL = 4, /* sends message to 4 nearby neighbors */
    PERMUTATION = 5,
    BISECTION = 6
};

例如如果想用STENCIL通信模式,可以直接在workload.conf中将作业名改为synthetic4

消息大小

通过命令行运行参数--payload_sz=n可以设置每次通信每隔通信对之间通信的数据量为 n bytes。在代码实现是中变量payload_sz

通过在workload.conf中每行的第4个参数,可以设置每隔多久,发送一波流量,单位是ns。在代码实现中是变量mean_interval_of_job

通过命令行运行参数--max_gen_data=n可以设置被模拟的每个进程一共生成 n bytes的流量。

加速技巧

参考:https://github.com/codes-org/codes/wiki/Optimistic-Performance-Tuning-Tips

这里非常难受的一点是,我一开始是照着这个Tuning Tips来调的,但是可扩展性还是非常难看。后来深入学习理解了ROSS,根据自己的理解来进行调整,性能一下就上去了。

  • --synch=5 --gvt-interval=100:现实中每过100ms同步一次
  • 将作业均匀或者随机分布在拓扑网络中,因为在程序中节点和路由器是线性映射到各个进程中的。如果集中在一部分节点上,也会导致模拟速度下降。

内存占用

发现这个东西贼吃内存,因此根据源码研究了一下到底是哪里占用了这么多内存

缓存MPI操作

src\workload\methods\codes-dumpi-trace-nw-wrkld.c[162]dumpi_insert_next_op()中会缓存所有的MPI操作进内存中。一个MPI操作占用的内存空间是72bytes,并且是以32768个操作为一块进行分配的,所有要跑MPI操作的逻辑进程初始都会分配一块,如果不够就再追加。也就是说一个4096进程的MPI作业模拟最少需要32768*4096*72bytes=9GB的内存缓存MPI操作。

在初始化阶段,申请内存的时候可以明显地看见一个内存慢慢减少的过程。

采样内存占用

src\network-workloads\model-net-mpi-replay.c[2308]nw_test_init()中有如下一行代码是用于采样时的数据记录

s->mpi_wkld_samples = (struct mpi_workload_sample*)calloc(MAX_STATS, sizeof(struct mpi_workload_sample));

在不采样时会占用相当大一块空间,把它改成如下

if(enable_sampling) {
    s->mpi_wkld_samples = (struct mpi_workload_sample*)calloc(MAX_STATS, sizeof(struct mpi_workload_sample));
}

其实我感觉它这个采样很多地方写的是错的,但是我也不用它的采样功能,所以不管了

数据包生成触发

src\networks\model-net\dragonfly-custom.C中调用model_net_method_idle_event()的两个地方我觉得写得也是有问题的,实际上也会引起在发送大数据包时大量的包堆积,导致大量的内存占用

原来分别是

nic_ts = g_tw_lookahead + (num_chunks * cn_delay) + tw_rand_unif(lp->rng);
// ...
if(s->terminal_length < 2 * s->params->cn_vc_size) { //TODO This hardcoded 2 * s->params->cn_vc_size seems dubious
    model_net_method_idle_event(nic_ts, 0, lp);
} else {
    bf->c11 = 1;
    s->issueIdle = 1;
// ...
if(s->issueIdle) {
    bf->c5 = 1;
    s->issueIdle = 0;
    ts += tw_rand_unif(lp->rng);
    model_net_method_idle_event(ts, 0, lp);
// ...

改成了

nic_ts = tw_rand_unif(lp->rng); // 更改触发延迟
// ...
if(s->terminal_length <= s->params->chunk_size) { // 更改触发条件
    model_net_method_idle_event(nic_ts, 0, lp);
} else {
    bf->c11 = 1;
    s->issueIdle = 1;
// ...
if(s->issueIdle && s->terminal_length <= s->params->chunk_size) { // 更改触发条件
    bf->c5 = 1;
    s->issueIdle = 0;
    ts = tw_rand_unif(lp->rng); // 更改触发延迟
    model_net_method_idle_event(ts, 0, lp);
// ...

回收内存

src\networks\model-net\dragonfly-dally.C中忘记对s->cc_st回收内存资源了,详情见这里

内存溢出

如果在程序的初始化阶段,出现了如下报错,(无论改用什么算例都会报错)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 230290 RUNNING AT cpn261
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================

说明很可能是参数--extramem=开得太大了,要改小一些

测试

胖树拓扑扩展性测试

for i in {1..36}
do
    echo "-----------------------------------------------------------------------------------------------------------------------------------------"
    echo $i
    mpirun -np $i \
    ../build/src/network-workloads/model-net-mpi-replay \
    `# ROSS parameter` \
    --synch=5 \
    --nkp=32 \
    --gvt-interval=100 \
    --clock-rate=2700000000 \
    --report-interval=0.1 \
    --max-opt-lookahead=15 \
    --extramem=1000000 \
    `# CODES parameter` \
    --workload_type="dumpi" \
    --workload_file=../../testcase/AMG/df_AMG_n1728_dumpi/dumpi-2014.03.03.14.55.50- \
    --num_net_traces=1728 \
    --disable_compute=1 \
    -- ../src/network-workloads/conf/modelnet-mpi-fattree-summit-k36-n3564.conf
done

结果

: Running Time = 717.5523 seconds
: Running Time = 4278.5345 seconds
: Running Time = 645.7763 seconds
: Running Time = 3012.0889 seconds
: Running Time = 498.2999 seconds
: Running Time = 2423.0130 seconds
: Running Time = 396.3798 seconds
: Running Time = 1920.7409 seconds
: Running Time = 381.0496 seconds
: Running Time = 1667.2916 seconds
: Running Time = 260.2268 seconds
: Running Time = 1325.1419 seconds
: Running Time = 1346.8052 seconds
: Running Time = 1153.2524 seconds
: Running Time = 1239.9438 seconds
: Running Time = 884.3635 seconds
: Running Time = 1056.7081 seconds
: Running Time = 674.0260 seconds
: Running Time = 899.2270 seconds
: Running Time = 588.0981 seconds
: Running Time = 807.9117 seconds
: Running Time = 307.6261 seconds
: Running Time = 705.6639 seconds
: Running Time = 866.5192 seconds
: Running Time = 693.8394 seconds
: Running Time = 907.6869 seconds
: Running Time = 528.8084 seconds
: Running Time = 858.9517 seconds
: Running Time = 545.1424 seconds
: Running Time = 706.0114 seconds
: Running Time = 447.5965 seconds
: Running Time = 718.6728 seconds
: Running Time = 454.8968 seconds
: Running Time = 522.9457 seconds
: Running Time = 763.8724 seconds
: Running Time = 562.4138 seconds

可以看出,胖树拓扑的模拟可扩展性比较惨不忍睹

Dragonfly拓展性测试

for i in {1..36}
do
    echo "-----------------------------------------------------------------------------------------------------------------------------------------"
    echo $i
mpirun -n $i \
    ../build/src/network-workloads/model-net-mpi-replay \
    `# ROSS parameter` \
    --synch=5 \
    --nkp=32 \
    --gvt-interval=100 \
    --clock-rate=2700000000 \
    --report-interval=0.1 \
    --max-opt-lookahead=15 \
    --extramem=1000000 \
    `# CODES parameter` \
    --workload_type="dumpi" \
    --workload_file=../../testcase/AMG/df_AMG_n1728_dumpi/dumpi-2014.03.03.14.55.50- \
    --num_net_traces=1728 \
    --disable_compute=1 \
    -- ../src/network-workloads/conf/dragonfly-custom/modelnet-test-dragonfly-edison.conf
done

结果

: Running Time = 172.1530 seconds
: Running Time = 179.5826 seconds
: Running Time = 197.6603 seconds
: Running Time = 148.8481 seconds
: Running Time = 103.7745 seconds
: Running Time = 97.3927 seconds
: Running Time = 90.8467 seconds
: Running Time = 77.6732 seconds
: Running Time = 72.7120 seconds
: Running Time = 71.8818 seconds
: Running Time = 63.1687 seconds
: Running Time = 60.0188 seconds
: Running Time = 61.8930 seconds
: Running Time = 54.7425 seconds
: Running Time = 49.0221 seconds
: Running Time = 45.6705 seconds
: Running Time = 44.8724 seconds
: Running Time = 41.8980 seconds
: Running Time = 41.3749 seconds
: Running Time = 39.8443 seconds
: Running Time = 37.0088 seconds
: Running Time = 37.6463 seconds
: Running Time = 36.9844 seconds
: Running Time = 35.2657 seconds
: Running Time = 35.2321 seconds
: Running Time = 34.8534 seconds
: Running Time = 34.0156 seconds
: Running Time = 32.5350 seconds
: Running Time = 32.3108 seconds
: Running Time = 31.0300 seconds
: Running Time = 31.3153 seconds
: Running Time = 31.7502 seconds
: Running Time = 31.8126 seconds
: Running Time = 31.1104 seconds
: Running Time = 32.7328 seconds
: Running Time = 32.0223 seconds

这里的加速比比较奇怪的原因是:作业都集中分布在了前面的节点,由于逻辑节点是线性映射到各个进程中的,会导致负载不均衡,实际上只有前面的进程在进行有效模拟(模拟有跑作业的节点),后面的进程在空转(模拟空闲的节点)

Dragonfly满配扩展性测试

这里模拟Cray XC满配状态下(92544个节点),模拟一个1728节点的作业,并且是在有网络干扰情况下的速度

for i in {1..36}
do
    echo "-----------------------------------------------------------------------------"
    echo $i
    mpirun -n $i \
        ../build/src/network-workloads/model-net-mpi-replay \
        `# ROSS parameter` \
        --synch=5 \
        --nkp=16 \
        --gvt-interval=100 \
        --clock-rate=2700000000 \
        --report-interval=0.1 \
        --max-opt-lookahead=15 \
        --extramem=200000 \
        `# CODES parameter` \
        --workload_type="dumpi" \
        --workload_conf_file=../../testcase/conf/workloads.conf \
        --alloc_file=../../testcase/conf/random_allocation.conf \
        --disable_compute=1 \
        -- ../src/network-workloads/conf/dragonfly-custom/modelnet-test-dragonfly-crayxc-full.conf

    sleep 3
    pkill -9 pmi_proxy
done

其中../testcase/conf/workloads.conf如下

90816 synthetic 0 10000
1728 ../../testcase/AMG/df_AMG_n1728_dumpi/dumpi-2014.03.03.14.55.50- 0 1

../../testcase/conf/random_allocation.conf中是生成的随机分配节点的配置

结果

: Running Time = 1407.5478 seconds
: Running Time = 773.4938 seconds
: Running Time = 626.1771 seconds
: Running Time = 456.3341 seconds
: Running Time = 393.5740 seconds
: Running Time = 336.1034 seconds
: Running Time = 298.6062 seconds
: Running Time = 284.0699 seconds
: Running Time = 249.4666 seconds
: Running Time = 236.4178 seconds
: Running Time = 222.9969 seconds
: Running Time = 217.6360 seconds
: Running Time = 204.2523 seconds
: Running Time = 196.8871 seconds
: Running Time = 190.6230 seconds
: Running Time = 166.5858 seconds
: Running Time = 160.5734 seconds
: Running Time = 157.4772 seconds
: Running Time = 153.9088 seconds
: Running Time = 148.5632 seconds
: Running Time = 145.8427 seconds
: Running Time = 142.1933 seconds
: Running Time = 141.3902 seconds
: Running Time = 135.1340 seconds
: Running Time = 132.2011 seconds
: Running Time = 130.1229 seconds
: Running Time = 128.7008 seconds
: Running Time = 125.8727 seconds
: Running Time = 122.3352 seconds
: Running Time = 125.8925 seconds
: Running Time = 122.2209 seconds
: Running Time = 121.4563 seconds
: Running Time = 120.9277 seconds
: Running Time = 117.9274 seconds
: Running Time = 119.0675 seconds
: Running Time = 123.3676 seconds

扩展性就比较正常了,如果在2个节点上运行如下

        : Running Time = 100.3780 seconds

TW Library Statistics:
        Total Events Processed                               277210872
        Events Aborted (part of RBs)                                 0
        Events Rolled Back                                    21763058
        Event Ties Detected in PE Queues                             0
        Efficiency                                               91.48 %
        Total Remote (shared mem) Events Processed                   0
        Percent Remote Events                                     0.00 %
        Total Remote (network) Events Processed               24337515
        Percent Remote Events                                     9.53 %

        Total Roll Backs                                       4423448
        Primary Roll Backs                                     1960034
        Secondary Roll Backs                                   2463414
        Fossil Collect Attempts                                5510448
        Total GVT Computations                                   76534

        Net Events Processed                                 255447814
        Event Rate (events/sec)                              2544859.3
        Total Events Scheduled Past End Time                         0

看起来2个节点并没有明显的加速,这是因为计算量不足导致的,如果将synthetic作业的通信间隔缩短到1000,即配置如下所示

90816 synthetic 0 1000
1728 ../../testcase/AMG/df_AMG_n1728_dumpi/dumpi-2014.03.03.14.55.50- 0 1

则单节点运行时间为516.9810 seconds,两个节点的运行时间为293.2926 seconds。