2022-10-22发表2025-04-15更新fedfab13 分钟读完 (大约1910个字)

fedml-1.配置＆运行&自定义

克隆fedml库

1	git clone https://github.com/FedML-AI/FedML.git

参阅源码
参考样例，测试环境
- FedML/iot/anomaly_detection_for_cybersecurity FedML/python/setup.py

wsl - cuda

参考

安装支持wsl的Nvidia驱动

下载地址

下载安装cuda

不要下载最新版11.7，pytorch目前不支持

sudo apt update
sudo apt install build-essential #安装c++ make等环境
wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda_11.6.2_510.47.03_linux.run
sudo bash cuda_11.6.2_510.47.03_linux.run

环境变量

在目录~/.bashrc下追加

1
2
3

export CUDA_HOME=/usr/local/cuda
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

生效，并安装需要的库

1 2	source ~/.bashrc sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

检查是否成功

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

安装 cudnn

下载安装

参考

sudo apt-get install zlib1g
https://developer.nvidia.com/rdp/cudnn-download
tar -xvf cudnn-linux-x86_64-8.4.1.50_cuda11.6-archive.tar.xz
sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include
sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

后来发现这种方式使用cnn时会报错，找不到cudnn cnn的动态链接库，改用以下方法
1
conda install -c nvidia cudnn

Table 1. Supported NVIDIA Hardware and CUDA Version
cuDNN Package	Supported NVIDIA Hardware	CUDA Toolkit Version	CUDA Compute Capability	Supports static linking?¹
cuDNN 8.4.1 for CUDA 11.x²	NVIDIA Ampere Architecture NVIDIA Turing™ NVIDIA Volta™ NVIDIA Pascal™ NVIDIA Maxwell^® NVIDIA Kepler™	11.7	SM 3.5 and later	Yes
		11.6
		11.5
		11.4
		11.3
		11.2		No
		11.1
		11.0
cuDNN 8.4.1 for CUDA 10.2	NVIDIA Turing NVIDIA Volta Xavier™ NVIDIA Pascal NVIDIA Maxwell NVIDIA Kepler	10.2	SM 3.0 and later	Yes

安装配置fedml

参考-（bilibili视频的方法已经过时）

安装Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
## 最后一步选yes，让脚本自动配置环境变量
## 根据输出的内容决定下面source的文件
source /root/.bashrc

添加Miniconda源

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
conda config --set show_channel_urls yes #设置搜索时显示通道地址

安装fedml

conda create --name fedml python=3.7
conda activate fedml
conda install --name fedml pip
pip install fedml -i https://pypi.douban.com/simple

备忘

常用pip源：
豆瓣：https://pypi.douban.com/simple
阿里：https://mirrors.aliyun.com/pypi/simple
中国科技大学 ：https://pypi.mirrors.ustc.edu.cn/simple/
清华大学： https://pypi.tuna.tsinghua.edu.cn/simple/
中国科学技术大学 ：https://pypi.mirrors.ustc.edu.cn/simple/

python包：wasabi-控制台打印和格式化工具包

安装fedml环境

进入FEDML仓库，FedML/python，运行setup.py
pytorch安装地址，根据cuda版本选择命令，去掉 -c forge（不从官网下载）
1
python3 setup.py install
卸载pytorch，重新按照cuda版本进行安装

1 2	conda uninstall torch cudatoolkit conda install pytorch torchvision torchaudio cudatoolkit=11.6

运行demo

iot

运行iot物联网demo

server

1 2	conda activate fedml bash run_server.sh

client-1

1 2	conda activate fedml bash run_client.sh 1

client-2

1 2	conda activate fedml bash run_client.sh 2

后来发现这个demo适用于树莓派、Jeston Nano设备，需要进行一定的配置
这个样例中有自定义data loader和trainer，比较有参考价值

mpi_torch_fedopt_mnist_lr_example

配置文件中，以simulation模式运行的
位置，以simulation模式运行，单进程

1	bash run_step_by_step_example.sh 2

参数为2时可以完成训练，参数（即worker）过大，会出现以下提示，怀疑是内存不够

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5218 RUNNING AT tt
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

训练完成后会在目录./tmp/下生成一个模型文件fedml，执行cat

cat fedml

输出为

1 2	training is finished! <fedml.arguments.Arguments object at 0x7ff03c13b210>

使用GPU进行训练

修改配置文件

修改config/fedml_config.yaml文件device_args标签下的内容

device_args:
  worker_num: 3
  using_gpu: true
  gpu_mapping_file: config/gpu_mapping.yaml #mapping文件位置
  gpu_mapping_key: mapping_tt #使用的mapping

修改mapping文件

在文件config/gpu_mapping.yaml下增加节点mapping_tt

1 2	mapping_tt: tt: [4] #只有一个节点tt，一个gpu

增加节点的格式如下，为每个hostname指定在每个gpu下有多少进程

## config_cluster0:
##     host_name_node0: [num_of_processes_on_GPU0, num_of_processes_on_GPU1, num_of_processes_on_GPU2, num_of_processes_on_GPU3, ..., num_of_processes_on_GPU_n]
##     host_name_node1: [num_of_processes_on_GPU0, num_of_processes_on_GPU1, num_of_processes_on_GPU2, num_of_processes_on_GPU3, ..., num_of_processes_on_GPU_n]
##     ......
##     host_name_node_m: [num_of_processes_on_GPU0, num_of_processes_on_GPU1, num_of_processes_on_GPU2, num_of_processes_on_GPU3, ..., num_of_processes_on_GPU_n]

运行，前面配置了4个进程，worker_num配置为3，这里参数就写3， 3个worker + 1个server
1
bash run_step_by_step_example.sh 3

过程中在bash中看到输出

1
2

[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 20:54:06] [INFO] [gpu_mapping_mpi.py:51:mapping_processes_to_gpu_device_from_yaml_file_mpi] process_id = 2, GPU device = cuda:0
[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 20:54:06] [INFO] [device.py:78:get_device] device = cuda:0

taskmgr中查看独显占用情况
说明确实使用了gpu

cross_silo 运行

非单进程模拟，多个设备运行

config/fedml_config.yaml

配置节点comm_args

1
2
3

comm_args:
  backend: "GRPC"
  grpc_ipconfig_path: config/grpc_ipconfig.csv

grpc_ipconfig.csv

安装pip install grpcio，使用grpc协议进行通信
在/config下创建该文件，写入编号-ip，0为server，1...n为worker

receiver_id	ip
0	127.0.0.1
1	127.0.0.1
2	127.0.0.1

结果

没有找到输出的模型文件，只在最后一轮结束后找到了以下输出

1
2
3

[FedML-Server(0) @device-id-0] [Fri, 29 Jul 2022 22:54:08] [INFO] [fedml_aggregator.py:195:test_on_server_for_all_clients] ################test_on_server_for_all_clients : 49
[FedML-Server(0) @device-id-0] [Fri, 29 Jul 2022 22:54:12] [INFO] [fedml_aggregator.py:225:test_on_server_for_all_clients] {'training_acc': 0.796526336274001, 'training_loss': 1.8660167525693983}
[FedML-Server(0) @device-id-0] [Fri, 29 Jul 2022 22:54:13] [INFO] [fedml_aggregator.py:257:test_on_server_for_all_clients] {'test_acc': 0.8005698005698005, 'test_loss': 1.8635211240936371}

fedml自定义

fedml运行参数

fedml运行时需要几个参数

参数	含义	可选值
–cf	配置文件	/
–rank	序号，server为0	0,1,2,3
–role	server 或 client	server, client

fedml运行流程

args = fedml.init()

## init device
device = fedml.device.get_device(args)

## load data
dataset, output_dim = fedml.data.load(args)

## load model
model = fedml.model.create(args, output_dim)

## start training
fedml_runner = FedMLRunner(args, device, dataset, model)
fedml_runner.run()

DataLoader, Model, Trainer都是可以自定义的
参考

DataLoader的自定义

支持MNN，pytorch的DataLoader
输出数据集和输出的维数

模型的自定义

支持pytorch的神经网络模型，torch.nn

Trainer的自定义

from fedml.core import ClientTrainer

class MyModelTrainer(ClientTrainer): #继承ClientTrainer
    def get_model_params(self):
        return self.model.cpu().state_dict()

    def set_model_params(self, model_parameters):
        self.model.load_state_dict(model_parameters)

    def train(self, train_data, device, args): #实现模型的训练
        pass

    def test(self, test_data, device, args):
        pass

    def test_on_the_server(
        self, train_data_local_dict, test_data_local_dict, device, args=None
    ) -> bool: #实现对模型的评估

        return True

用qemu虚拟机模拟多台机器（未完成）

安装

参考连接

Ninja的安装

1	sudo apt install ninja-build

pkg-config

wget https://download.qemu.org/qemu-7.1.0-rc0.tar.xz
tar xvJf qemu-7.1.0-rc0.tar.xz
cd qemu-7.1.0-rc0
./configure
make

使用虚拟机环境，需要对qemu配置显卡直通

fedml-1.配置＆运行&自定义

https://jingtianer.github.io/home/2022/10/22/fedfab/fedml/1.配置＆运行&自定义/

作者

Meow Meow Liu

发布于

2022-10-22

更新于

2025-04-15

许可协议

#fedfab

fedml-1.配置＆运行&自定义

克隆fedml库

wsl - cuda

安装支持wsl的Nvidia驱动

下载安装cuda

环境变量

安装 cudnn

下载安装

安装配置fedml

安装Miniconda

添加Miniconda源

安装fedml

备忘

安装fedml环境

运行demo

iot

server

client-1

client-2

mpi_torch_fedopt_mnist_lr_example

使用GPU进行训练

修改配置文件

修改mapping文件

cross_silo 运行

config/fedml_config.yaml

grpc_ipconfig.csv

结果

fedml自定义

fedml运行参数

fedml运行流程

DataLoader的自定义

模型的自定义

Trainer的自定义

用qemu虚拟机模拟多台机器（未完成）

安装

Ninja的安装

pkg-config

作者

发布于

更新于

许可协议

评论

目录