Interest in using analytics platforms such as Hadoop and Spark to process highly sensitive personal data (e.g., health care data, financial records) is on the rise. Platforms for processing this type of data must conform to numerous regulations intended to ensure data privacy, integrity, and access control, thus making their deployment time consuming and error prone. In this talk, we share our experiences using Sahara to automate the provisioning of a HIPAA-enabled Spark-as-a-Service platform. We detail the enhancements to Sahara needed to: (i) automate cluster security enablement (e.g., authentication, key management, encryption); (ii) support multi-user clusters to provide strong isolation; (iii) enable cluster deployment on SoftLayer through Heat; and (iv) submit multi-type jobs (e.g,. Hadoop and Spark) to YARN using Sahara API. Finally, we discuss the lessons learned from our experience with Sahara and share directions for further improvements.
We discuss how various requirements of HIPAA (e.g., isolation, data encryption) map to Spark and YARN and detail the enhancements to Sahara needed to: (i) automate the configuration of security features in the cluster (such as configuring Kerberos for authentication, setting up SSL certificates, enabling HDFS encryption, and managing keys, etc.); (ii) support safe multi-user clusters to ensure data from one user cannot be leaked to another; (iii) enable cluster deployment on SoftLayer through Heat; (iv) submit multiple job types (e.g., Hadoop, Spark) to YARN using Sahara API; and (v) support iPython and Spark Job Server as part of Spark cluster services. Finally, we discuss the lessons learned from our experience along with initial performance considerations and highlight the gaps yet to be addressed.