While adoption of the Cloud & Kubernetes have made it exceptionally easy to scale compute, the increasing spread of data across different systems and clouds has created new challenges for data engineers. Effectively accessing data from AWS S3 or on-premises HDFS becomes harder and data locality is also lost – how do you move data to compute workers efficiently, how do you unify data across multiple or remote clouds, and many more.
Open source project Alluxio approaches this problem in a new way. It helps elastic compute workloads realize the true benefits of the cloud, while bringing data locality and data accessibility to workloads orchestrated by Kubernetes. Alluxio can orchestrate data locality from any persistent storage including object store such as Ceph and cloud storage such as AWS S3 or GCS and make it accessible to compute running in Kubernetes pods. As a stateless data access layer, Alluxio runs as a native service making data-intensive compute workloads Kubernetes friendly.
In this webinar, Adit will present this new approach of bringing data locality to data-intensive compute workloads in Kubernetes environments, and demo how to setup and run Apache Spark and Alluxio in Kubernetes.