原创 在K8S中部署一站式机器学习套件Kubeflow
kubeflow 是一个机器学习套件,包含数据探索、模型训练、模型服务等常见的机器学习操作,这所有的一切都运行在 K8S 上。
以下是官方给的 kubeflow架构图open in new window
下面这是一张典型的实验和模型服务阶段的流程。 (Example of a specific ML workflow)
安装 kubeflow 的前置条件
软件及版本
Kubernetes
(1.19
) with a default StorageClassopen in new window,本次测试版本为 1.20kustomize
(version3.2.0
) 下载链接open in new windowkubectl
如果集群中已经有 istio,建议使用官方安装脚本中自带的 istio。
镜像访问
安装 kubeflow 过程中会 docker pull k8s.gcr.io,默认无法访问,可以使用 Docker 配置 HTTP 代理
安装 kubeflow
参照 kubeflow 官方安装文档open in new window
首先 git clone kubeflow 安装仓库open in new window,在根目录下执行下面这行命令。
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
访问 kubeflow
使用 kubectl port-forward 转发本地端口以供访问。
kubectl port-forward --address 0.0.0.0 svc/istio-ingressgateway -n istio-system 8080:80
也可以使用 NodePort 来访问(修改 istio-system namespace 下的 istio-ingressgateway )
访问 http://localhost:8080
,默认账号为 user@example.com,密码为 12341234
以下是 kubeflow 的产品截图。
notebook:可以直接在页面上创建 notebook
automl:带有 automl 特性的模型训练
pipeline
FAQ
log: "Could not find CSRF cookie XSRF-TOKEN in the request."open in new window
在对应的服务中设置环境变量 APP_SECURE_COOKIES=false
New Volume [403] Could not find CSRF cookie XSRF-TOKEN in the request. http://xx.xx.xx..:30776/volumes/api/namespaces/kubeflow-user-example-com/pvcs
upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER
jupyter-web-app-deployment
volumes-web-app-deployment
tensorboards-web-app-deployment
、katib-ui
、ml-pipeline-ui-artifact
、ml-pipeline-ui
、kfserving-models-web-app
istio-system下的 cluster-local-gateway pod 部署失败
提示 MountVolume.SetUp failed for volume "istio-token" : failed to fetch token: the API server does not have TokenRequest endpoints enabled
根据 kubeflow issueopen in new window 中官方回复,需要 kube-apiserver
增加两个参数 --service-account-issuer=kubernetes.default.svc
、--service-account-signing-key-file=/etc/kubernetes/ssl/sa.key
,当然也可以直接使用 K8S 1.20.
我用的是腾讯云托管的集群,是没办法修改 kube-apiserver
,直接创建一个 1.20 版本的集群解决了。
以下是官方 issue 贴原文。
The kube-apiserver needs additional arguments: --service-account-issuer=kubernetes.default.svc and --service-account-signing-key-file=/etc/kubernetes/ssl/sa.key for Istio to work. I can suggest using Kubernetes 1.20 as these flags are set by default as of this version, so you will not run into this problem.
failed to warm certificate: failed to generate workload certificate: create certificate: rpc erroropen in new window
本想用 腾讯云的自带的服务网格 istio 1.10 ,结果发现版本不兼容,直接用 kubeflow 自带的 1.9 版本就解决了。
2021-09-16T03:48:08.339153155Z 2021-09-16T03:48:08.339019Z warn Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 0 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2021-09-16T03:48:08.354660227Z 2021-09-16T03:48:08.354542Z warn ca ca request failed, starting attempt 3 in 402.682469ms
2021-09-16T03:48:08.757614554Z 2021-09-16T03:48:08.757468Z warn ca ca request failed, starting attempt 4 in 851.643716ms
2021-09-16T03:48:09.609510460Z 2021-09-16T03:48:09.609370Z warn sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 172.18.255.44:53: no such host"
2021-09-16T03:48:09.729535116Z 2021-09-16T03:48:09.729410Z warn ca ca request failed, starting attempt 1 in 106.762296ms
2021-09-16T03:48:09.836435578Z 2021-09-16T03:48:09.836288Z warn ca ca request failed, starting attempt 2 in 210.968515ms
2021-09-16T03:48:10.047573803Z 2021-09-16T03:48:10.047467Z warn ca ca request failed, starting attempt 3 in 405.124132ms
2
3
4
5
6
7
Error opening bolt store: open /var/lib/authservice/data.db: permission denied
authservice
pod 出现如上报错。
最后把一个部署 kubeflow 正常集群中的 data.db 拷贝到 有报错的集群的 authservice-pvc
中,就解决了。
reference
- [1] kubeflow. Kubeflow Manifestsopen in new window