Event Details

Please note: All times listed below are in Central Time Zone

<< Go back

Automating the Deployment of a Secure, Multi-User HIPAA Enabled Spark Cluster Using Sahara

Big Data

Interest in using analytics platforms such as Hadoop and Spark to process highly sensitive personal data (e.g., health care data, financial records) is on the rise. Platforms for processing this type of data must conform to numerous regulations intended to ensure data privacy, integrity, and access control, thus making their deployment time consuming and error prone. In this talk, we share our experiences using Sahara to automate the provisioning of a HIPAA-enabled Spark-as-a-Service platform. We detail the enhancements to Sahara needed to: (i) automate cluster security enablement (e.g., authentication, key management, encryption); (ii) support multi-user clusters to provide strong isolation; (iii) enable cluster deployment on SoftLayer through Heat; and (iv) submit multi-type jobs (e.g,. Hadoop and Spark) to YARN using Sahara API. Finally, we discuss the lessons learned from our experience with Sahara and share directions for further improvements.

What can I expect to learn?

We discuss how various requirements of HIPAA (e.g., isolation, data encryption) map to Spark and YARN and detail the enhancements to Sahara needed to: (i) automate the configuration of security features in the cluster (such as configuring Kerberos for authentication, setting up SSL certificates, enabling HDFS encryption, and managing keys, etc.); (ii) support safe multi-user clusters to ensure data from one user cannot be leaked to another; (iii) enable cluster deployment on SoftLayer through Heat; (iv) submit multiple job types (e.g., Hadoop, Spark) to YARN using Sahara API; and (v) support iPython and Spark Job Server as part of Spark cluster services. Finally, we discuss the lessons learned from our experience along with initial performance considerations and highlight the gaps yet to be addressed.

Wednesday, April 27, 2:40pm-3:20pm (7:40pm - 8:20pm UTC)

Austin Convention Center - Level 4 - Ballroom D

View video

Difficulty Level: Beginner

Tags: Ops Enterprise Community User Talk Upstream Dev Heat Horizon Sahara

Michael Le

IBM

Michael Le is currently a research staff member at the IBM T. J. Watson Research Center. His current research focus is on cloud infrastructure and cloud platform management. Michael has previously worked on automating deployment of secure and compliant data analytics platform using OpenStack Sahara and is currently working on how to better secure and isolate applications deployed on the cloud. FULL PROFILE

Shu Tao

Research Staff Member, Manager

Shu Tao is manager of cloud infrastructure and data services department at IBM T J Watson Research Center. He has been working on various OpenStack-related projects since 2012. His current interests about OpenStack include control plane performance and scalability, Neutron plugins, and support for data intensive applications on OpenStack cloud. FULL PROFILE

Jayaram Radhakrishnan

Research scientist

Jayaram KR is a Research Scientist at IBM Thomas J. Watson Research Center in Yorktown Heights, NY. His research interests span distributed systems and software engineering. Specific topics of interest include elasticity, complex-event processing and publish/subscribe systems, GPU-accelerated analytics platforms and cloud security. He holds MS and PhD degrees in Computer Science from Purdue... FULL PROFILE

Daniel Dean

Research scientist

Daniel Dean: Is a Research Staff Member at IBM where he is part of the team developing the Watson Health Cloud. His general research interests are in computing systems with a focus on production system performance anomaly management, distributed systems, and cloud environments. FULL PROFILE

Event Details

Registration Opening Soon