fedml-1.配置&运行&自定义
克隆fedml库
1 | git clone https://github.com/FedML-AI/FedML.git |
- 参阅源码
- 参考样例,测试环境
- FedML/iot/anomaly_detection_for_cybersecurity FedML/python/setup.py
wsl - cuda
安装支持wsl的Nvidia驱动
下载安装cuda
- 不要下载最新版11.7,pytorch目前不支持
1
2
3
4sudo apt update
sudo apt install build-essential #安装c++ make等环境
wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda_11.6.2_510.47.03_linux.run
sudo bash cuda_11.6.2_510.47.03_linux.run
环境变量
- 在目录
~/.bashrc
下 追加1
2
3export CUDA_HOME=/usr/local/cuda
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} - 生效,并安装需要的库
1
2source ~/.bashrc
sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev - 检查是否成功
1
nvcc -V
1
2
3
4
5nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
安装 cudnn
下载安装
-
1
2
3
4
5
6sudo apt-get install zlib1g
https://developer.nvidia.com/rdp/cudnn-download
tar -xvf cudnn-linux-x86_64-8.4.1.50_cuda11.6-archive.tar.xz
sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include
sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn* 后来发现这种方式使用cnn时会报错,找不到cudnn cnn的动态链接库,改用以下方法
1
conda install -c nvidia cudnn
cuDNN Package | Supported NVIDIA Hardware | CUDA Toolkit Version | CUDA Compute Capability | Supports static linking?1 |
---|---|---|---|---|
cuDNN 8.4.1 for CUDA 11.x2 |
|
11.7 | SM 3.5 and later | Yes |
11.6 | ||||
11.5 | ||||
11.4 | ||||
11.3 | ||||
No | ||||
11.1 | ||||
11.0 | ||||
cuDNN 8.4.1 for CUDA 10.2 |
|
10.2 | SM 3.0 and later | Yes |
安装配置fedml
安装Miniconda
1 | wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh |
添加Miniconda源
1 | conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ |
安装fedml
1 | conda create --name fedml python=3.7 |
备忘
1 | 常用pip源: |
- python包:wasabi-控制台打印和格式化工具包
安装fedml环境
- 进入FEDML仓库,FedML/python,运行setup.py
- pytorch安装地址,根据cuda版本选择命令,去掉 -c forge(不从官网下载)
1
python3 setup.py install
- 卸载pytorch,重新按照cuda版本进行安装
1 | conda uninstall *torch* cudatoolkit |
运行demo
iot
server
1 | conda activate fedml |
client-1
1 | conda activate fedml |
client-2
1 | conda activate fedml |
- 后来发现这个demo适用于树莓派、Jeston Nano设备,需要进行一定的配置
- 这个样例中有自定义data loader和trainer,比较有参考价值
mpi_torch_fedopt_mnist_lr_example
- 配置文件中,以simulation模式运行的
- 位置,以simulation模式运行,单进程
1 | bash run_step_by_step_example.sh 2 |
- 参数为2时可以完成训练,参数(即worker)过大,会出现以下提示,怀疑是内存不够
1
2
3
4
5
6
7
8
9
10===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 5218 RUNNING AT tt
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions - 训练完成后会在目录
./tmp/
下生成一个模型文件fedml
,执行cat输出为1
cat fedml
1
2training is finished!
<fedml.arguments.Arguments object at 0x7ff03c13b210>
使用GPU进行训练
修改配置文件
- 修改
config/fedml_config.yaml
文件device_args
标签下的内容1
2
3
4
5device_args:
worker_num: 3
using_gpu: true
gpu_mapping_file: config/gpu_mapping.yaml #mapping文件位置
gpu_mapping_key: mapping_tt #使用的mapping
修改mapping文件
- 在文件
config/gpu_mapping.yaml
下增加节点mapping_tt
1 | mapping_tt: |
增加节点的格式如下,为每个hostname指定在每个gpu下有多少进程
1
2
3
4
5## config_cluster0:
## host_name_node0: [num_of_processes_on_GPU0, num_of_processes_on_GPU1, num_of_processes_on_GPU2, num_of_processes_on_GPU3, ..., num_of_processes_on_GPU_n]
## host_name_node1: [num_of_processes_on_GPU0, num_of_processes_on_GPU1, num_of_processes_on_GPU2, num_of_processes_on_GPU3, ..., num_of_processes_on_GPU_n]
## ......
## host_name_node_m: [num_of_processes_on_GPU0, num_of_processes_on_GPU1, num_of_processes_on_GPU2, num_of_processes_on_GPU3, ..., num_of_processes_on_GPU_n]运行,前面配置了4个进程,
worker_num
配置为3,这里参数就写3, 3个worker + 1个server1
bash run_step_by_step_example.sh 3
过程中在bash中看到输出
1
2[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 20:54:06] [INFO] [gpu_mapping_mpi.py:51:mapping_processes_to_gpu_device_from_yaml_file_mpi] process_id = 2, GPU device = cuda:0
[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 20:54:06] [INFO] [device.py:78:get_device] device = cuda:0taskmgr中查看独显占用情况
说明确实使用了gpu
cross_silo 运行
- 非单进程模拟,多个设备运行
config/fedml_config.yaml
- 配置节点comm_args
1
2
3comm_args:
backend: "GRPC"
grpc_ipconfig_path: config/grpc_ipconfig.csv
grpc_ipconfig.csv
- 安装
pip install grpcio
,使用grpc协议进行通信 - 在
/config
下创建该文件,写入编号-ip
,0
为server,1...n
为worker
receiver_id | ip |
---|---|
0 | 127.0.0.1 |
1 | 127.0.0.1 |
2 | 127.0.0.1 |
结果
- 没有找到输出的模型文件,只在最后一轮结束后找到了以下输出
1
2
3[FedML-Server(0) @device-id-0] [Fri, 29 Jul 2022 22:54:08] [INFO] [fedml_aggregator.py:195:test_on_server_for_all_clients] ################test_on_server_for_all_clients : 49
[FedML-Server(0) @device-id-0] [Fri, 29 Jul 2022 22:54:12] [INFO] [fedml_aggregator.py:225:test_on_server_for_all_clients] {'training_acc': 0.796526336274001, 'training_loss': 1.8660167525693983}
[FedML-Server(0) @device-id-0] [Fri, 29 Jul 2022 22:54:13] [INFO] [fedml_aggregator.py:257:test_on_server_for_all_clients] {'test_acc': 0.8005698005698005, 'test_loss': 1.8635211240936371}
fedml自定义
fedml运行参数
- fedml运行时需要几个参数
参数 | 含义 | 可选值 |
---|---|---|
–cf | 配置文件 | / |
–rank | 序号,server为0 | 0,1,2,3 |
–role | server 或 client | server, client |
fedml运行流程
1 | args = fedml.init() |
- DataLoader, Model, Trainer都是可以自定义的
- 参考
DataLoader的自定义
- 支持MNN,pytorch的DataLoader
- 输出数据集和输出的维数
模型的自定义
- 支持pytorch的神经网络模型,
torch.nn
Trainer的自定义
1 | from fedml.core import ClientTrainer |
用qemu虚拟机模拟多台机器(未完成)
安装
- 参考连接
Ninja的安装
1 | sudo apt install ninja-build |
pkg-config
1 |
1 | wget https://download.qemu.org/qemu-7.1.0-rc0.tar.xz |
- 使用虚拟机环境,需要对qemu配置显卡直通