原创 在K8S中部署一站式机器学习套件Kubeflow

kubeflow 是一个机器学习套件,包含数据探索、模型训练、模型服务等常见的机器学习操作,这所有的一切都运行在 K8S 上。

以下是官方给的 kubeflow架构图open in new window

-w1783-w1191-w1918

下面这是一张典型的实验和模型服务阶段的流程。 -w1849 (Example of a specific ML workflow)

安装 kubeflow 的前置条件

软件及版本

如果集群中已经有 istio,建议使用官方安装脚本中自带的 istio。

镜像访问

安装 kubeflow 过程中会 docker pull k8s.gcr.io,默认无法访问,可以使用 Docker 配置 HTTP 代理

安装 kubeflow

参照 kubeflow 官方安装文档open in new window

首先 git clone kubeflow 安装仓库open in new window,在根目录下执行下面这行命令。

while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
1

访问 kubeflow

使用 kubectl port-forward 转发本地端口以供访问。

kubectl port-forward --address 0.0.0.0 svc/istio-ingressgateway -n istio-system 8080:80
1

也可以使用 NodePort 来访问(修改 istio-system namespace 下的 istio-ingressgateway )

访问 http://localhost:8080,默认账号为 user@example.com,密码为 12341234

以下是 kubeflow 的产品截图。

kubeflow-index

  • notebook:可以直接在页面上创建 notebook kubeflow-notebook

  • automl:带有 automl 特性的模型训练 kubeflow-automl

  • pipeline kubeflow-pipelineskubeflow-pipelines-demokubeflow-pipelines-yaml

FAQ

在对应的服务中设置环境变量 APP_SECURE_COOKIES=false

-w978

New Volume [403] Could not find CSRF cookie XSRF-TOKEN in the request. http://xx.xx.xx..:30776/volumes/api/namespaces/kubeflow-user-example-com/pvcs

upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER

jupyter-web-app-deployment volumes-web-app-deployment tensorboards-web-app-deploymentkatib-uiml-pipeline-ui-artifact ml-pipeline-uikfserving-models-web-app

istio-system下的 cluster-local-gateway pod 部署失败

提示 MountVolume.SetUp failed for volume "istio-token" : failed to fetch token: the API server does not have TokenRequest endpoints enabled

根据 kubeflow issueopen in new window 中官方回复,需要 kube-apiserver 增加两个参数 --service-account-issuer=kubernetes.default.svc--service-account-signing-key-file=/etc/kubernetes/ssl/sa.key,当然也可以直接使用 K8S 1.20.

我用的是腾讯云托管的集群,是没办法修改 kube-apiserver,直接创建一个 1.20 版本的集群解决了。

以下是官方 issue 贴原文。

The kube-apiserver needs additional arguments: --service-account-issuer=kubernetes.default.svc and --service-account-signing-key-file=/etc/kubernetes/ssl/sa.key for Istio to work. I can suggest using Kubernetes 1.20 as these flags are set by default as of this version, so you will not run into this problem.

failed to warm certificate: failed to generate workload certificate: create certificate: rpc erroropen in new window

本想用 腾讯云的自带的服务网格 istio 1.10 ,结果发现版本不兼容,直接用 kubeflow 自带的 1.9 版本就解决了。

2021-09-16T03:48:08.339153155Z 2021-09-16T03:48:08.339019Z  warn  Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 0 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2021-09-16T03:48:08.354660227Z 2021-09-16T03:48:08.354542Z  warn  ca  ca request failed, starting attempt 3 in 402.682469ms
2021-09-16T03:48:08.757614554Z 2021-09-16T03:48:08.757468Z  warn  ca  ca request failed, starting attempt 4 in 851.643716ms
2021-09-16T03:48:09.609510460Z 2021-09-16T03:48:09.609370Z  warn  sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 172.18.255.44:53: no such host"
2021-09-16T03:48:09.729535116Z 2021-09-16T03:48:09.729410Z  warn  ca  ca request failed, starting attempt 1 in 106.762296ms
2021-09-16T03:48:09.836435578Z 2021-09-16T03:48:09.836288Z  warn  ca  ca request failed, starting attempt 2 in 210.968515ms
2021-09-16T03:48:10.047573803Z 2021-09-16T03:48:10.047467Z  warn  ca  ca request failed, starting attempt 3 in 405.124132ms
1
2
3
4
5
6
7

Error opening bolt store: open /var/lib/authservice/data.db: permission denied

authservice pod 出现如上报错。

最后把一个部署 kubeflow 正常集群中的 data.db 拷贝到 有报错的集群的 authservice-pvc 中,就解决了。

reference