Accelerate your AI projects with Spectrum Scale
Combining Spectrum Discover and Spectrum Scale will accelerate your AI projects.
In a earlier blog I'm explained how metadata and Spectrum Discover can take your Machine Learning project to the next level. You can read it here
IBM Spectrum Scale is well-known as the high-performance distributed, parallel file system at the basis of many industrial systems, including the top 2 fastest supercomputer in the world IBM Summit and Sierra. It provides high-performance access to the data within a single namespace via different protocols (including NFS, SMB, Object, HDFS, and a POSIX interface), allowing to transparently use different types of back-end storage tiers (NVMe, HDD, Tape and Cloud) according to the user needs, as well as many enterprises features such as file audit, encryption, WORM and others.
Supporting hybrid and multi-cloud deployment to enable global collaboration, IBM Spectrum Scale is perfect and “combat-proven” solution to build a solid and enterprise-wide data lake. It brings high I/O performance, data resilience, simplicity and flexibility to any distributed applications. In the context of AI & Analytics, its single namespace breaks traditional data silos, allowing ingestion streams, preparation, training, and inference applications to access data in place.
Considering metadata value presented previously, here is why coupling Scale and Discover makes perfect sense to accelerate the different stages of your AI projects.
- Data Ingestion & preparation: accelerate data labeling, improve dataset quality
This is one of the most powerful aspects of IBM Spectrum Discover, its ability to accommodate custom metadata created by users of the associated data or the Spectrum Discover system itself. Metadata can be collected simply by creating software agents to extract domain-specific metadata, unique to the business on the fly when a new data is ingested in Spectrum Scale data lake.
Taking the Healthcare industry as one example where Spectrum Scale is widely deployed. The DICOM (Digital Imaging and Communications in Medicine) is today the international standard to communicate and manage medical images and data. It ensures the interoperability of systems used to produce, store, share, display, send, query, process, retrieve and print medical images, as well as to manage related workflows.
Pairing with a ‘DICOM Reader plugin’, Spectrum Discover is capable to extract and register in real-time the DICOM metadata records (including Patient PHI/PII, pathology, scan type, resolution and other characteristics) in its own catalog when every single DICOM file is ingested. This provides additional metadata associated with each DICOM file which help data scientists to build further custom rules to curate outliers, normalize fields, all easing the data cleaning step in the data preparation phase.
- Data Analysis & Model Training: Feed GPU’s with data while mitigating the costs
Today, state-of-the-art analytics and Machine Learning applications are mostly powered by GPU’s to analyze more data faster and make better business decisions. The biggest performance advantage of modern GPU computing is also creating its biggest challenge: GPUs have an amazing appetite for data in the training phase. Current GPUs can process up to 16GB/s of data and the NVIDIA’s latest DGX-2 system has as many as 16 GPUs.
So, starving the GPUs with slow storage, or wasting time copying data, wastes expensive GPU resources and affects the overall ROI of AI and ML use cases.
In that context, Spectrum Scale demonstrates an unmatched performance leadership in AI with 2,5TB/s demonstrated throughput in a scale-out deployment architecture, 40GB/s in the 2U Spectrum Scale NVMe appliance (ESS3000). This high-performance tier is therefore available to feed GPU with dataset but only during intensive training phase given the costs incurred.
IBM Spectrum Discover addition to Spectrum Scale is key to automate the selection and prefetching of the right dataset from the ingestion/capacity tier (traditional NAS, Object Storage) into this Hight Performance Storage tier based on NVME. Discover will operate the Spectrum Scale AFM policy engine to optimize utilization of multiple storage tiers, such as from Flash, to Disk to Cloud and to Tape and transparently position the training dataset at the right place at the right time and at the right costs.
There is one example of efficient data caching using the Spectrum Scale Data Accelerator for AI & Analytics in the IBM Smoke Detection in Video Imagery for Prescribed Fire Management https://www.ibm.com/downloads/cas/GGWQ40KE
The combination of Spectrum Discover to an IBM Spectrum Scale is really power full to fully exploit the performance while optimizing the overall economics of the data lake.
- Data Governance and traceability for Dataset and AI models lifecycles
Spectrum Discover gives the ability to improve data governance for your data lake. It means mitigating risk, remaining compliant with regulatory requirements, and improving data quality by identifying data that needs to be retained to meet internal policies. It also implies identifying data that presents a risk to the organization and should be removed from the storage environment (PHI/PII information for instance in the context of GDPR). IBM Spectrum Discover’s rapid content search capabilities across large scale environments enables faster responsiveness to requests for legal discovery and regulatory audit.
Also, in the context of AI and Analytics, IBM Spectrum Discover offer better traceability for datasets and models. For instance, it’s possible to associate custom metadata for each iteration of model, specifying which dataset, which hyperparameters and which preprocessing functions were used to train the model and associate inference results (performance and accuracy). This helps data scientists to easily compare the performance of different model but also to support reverse engineering to investigate why a model is failing.
Leveraging such type information, Spectrum Discover can operate key enterprise features from Spectrum Scale (ie. archiving to tape, activate file immutability, file encryption and audit logging) to support the specific data governance requirements.