Join our presentation to learn how you can build your cluster for machine learning business. Machine learning and AI are obviously recent new trend of technologies. NTT, our big telecommunication company, also has its AI brand "Corevo". This presentation shares the experience, how to build and manage our cloud-like computing infrastructure for our company use case, in which how we've been managing the full open source computing cluster environment including OpenStack components and container technologies.
In this talk, we'd like to introduce our case study that a full-open sourced reference cluster model with Ansible and container orchestrator automation. The environment built on GPU computation and high speed storage, in which we use Chainer and ChainerMN learning framework with many NVIDIA GPU nodes, and attach perfectly scalable OpenStack Swift object storage with file system APIs as the high speed data storage.
Attendees will be able to learn basic strategies on how you can build your own machine learning cluster on your use case. In this talk, we will share the software stack and the hardware stack consideration, in particlur including modern machine learning framework like Chainer and ChainerMN, Ansible and docker container orchestration, and OpenStack Swift storage with FileSytem API for AI/HPC. And we will also describe about the summary of the performance and the operation efficiency.
On the architecture design, our consideration consists of both operators and users (UsersOps) rather than DevOps because our machine learning researchers has joined the operation team to build the cluster. Absolutely, attendees will be able to learn such a significant perspective when building your own cluster and they will be able to get connected with us to discuss how we can improve the cluster management.