Take ML/DL to the next level with Metadata
Why metadata matters
To set the scene, let me first start with the definition of metadata. According to the NISO (link), “Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about data or information about information.” In other words, metadata provides information needed to understand and effectively use data.
This definition points out metadata is a huge source of value for all « data players », whatever you are a data scientist, an analysts, BI professionals or Storage IT teams. This value becomes even more important in the context of data lake due to the size of data stores (scaling from Petabyte to Exabyte) and the nature of the data stored.
As the foundation of most of company’s data planes, data lake solutions are widely implemented, offering scalable and flexible data repositories to facilitate different types of data analysis such as AI/ML algorithms, statistics, data visualization. Key capabilities are massive ingestion of raw data from various sources, storing data in their native format, visualizing and processing them only upon usage.
However, the absence of explicit data schema can easily turn large data lakes into swamp where data become invisible, inaccessible and unreliable to data scientists. In that context, metadata is the most efficient information to allow them finding data that correspond to needs, accelerate data accesses, verify data origin and processing history and find relevant data to enrich their analyze.
By creating a structure in an unstructured data repository, a good metadata management solution is your best friend to fully exploit deluge of data captured every day. By extension, the quality of metadata is as critical as the quality of the data itself as it directly affects the discoverability, the use and reuse of this resource.
Why Spectrum Discover excels as Metadata management solution
IBM Spectrum Discover is an on-premises interactive data management platform providing data insight for exabyte-scale unstructured storage. This software solution is non-disruptive to storage and applications and allows to extract metadata to a centralized database, create custom indexes which can then be queried with complex requests to extract data insights from large scale data sources without affecting the data lake performance.
It provides direct answers to the following questions: Where does any particular data asset reside? What, specifically, is contained in each of those thousands or millions of files? How old is it? Who has access to it–and should they? Is the data compliant with the latest governmental and international regulations?
What make IBM Spectrum Discover different from other solutions in the market:
- Ability to connect to heterogeneous File and Object data sources from both IBM and non-IBM storage systems incl. Dell EMC Isilon, NetApp, Amazon S3, Ceph, and other S3- and NFSv3-compliant and SMB data sources.
- Real-time update of the metadata repository with automatic capture/index of metadata when new data is ingested, or existing data modified on IBM Storage solutions
- Policy-based custom tagging that enables users to automatically create custom metadata to classify and categorize data aligned with the needs of their business.
Finally, it’s also important to highlight the Software Developers Kit (SDK) that offers users unlimited capability to enrich and exploit their metadata with their own priorities. Leveraging Spectrum Discover, users can build Action Agents that extract metadata from both file headers and content, automate data movement (i.e. either be deleted or moved to a cheaper/colder tier) and provide integration to open source software, such as Apache Spark, Apache Tika, PyTorch, Caffe and TensorFlow, which facilitates data identification (content-based data classification and tagging) and speeds large-scale data processing.
As one example, IBM has announced on April 28th, 2020 new Spectrum Discover capabilities to utilize IBM Watson® solutions and IBM Cloud Pak for Data to make data “smarter” and more valuable. Now, with just a single click, IBM Spectrum Discover can export data information to IBM Watson Knowledge Catalog and leverage other IBM Watson AI solutions integrated with the Catalog. One of the use cases available is to automatically inspect and classify over 1000 unstructured data types, including genomics and imaging specific file formats.
Spectrum Discover enables clients to create an open, transparent and evolutive data ecosystem. In the context of AI and Analytics, this metadata repository enables data scientists to efficiently manage, classify data and gain insights to accelerate large-scale analytics over billions of both files and objects.