MetaLake: A Layer Above Data Lakes

Channa Wijerathna
September 26, 2023

How Crucial Metadata Is?

If you're a machine learning engineer, you probably work with a vast number of images. Have you ever tried to find that one special photo among millions? Let's say you need to find <p-bold>images of bicycles that have a misty background and were captured in Los Angeles.<p-bold> It can feel like searching for a needle in a haystack. In the past, you might have spent hours or even days sifting through countless folders. Such a process is not only slow but also exhausting. However, there's now a smarter way to handle this.

So, let's dive back into that intricate challenge I mentioned earlier. You need to find images of bicycles with misty backgrounds, all captured in Los Angeles. Traditional storage systems are downright frustrating for this level of detective work. Sure, you could create a makeshift 'captured in Los Angeles' folder or try to manage a separate annotation file to identify those elusive bicycles, but let's be honest—these are band-aid solutions that still leave you in the lurch.

Now, envision a world where you could easily set a custom metadata tag like 'captured location' upon uploading your image collection. If such metadata was readily searchable, you'd instantly whittle down your quest to a more manageable scale. But wait, you're not just looking for any images taken in Los Angeles; you're looking for bicycles. Here's where annotation data shines. If every image had attached annotation data, you could sift through those tagged as 'bicycles' in no time.

Fantastic, you've managed to narrow it down to bicycle images in Los Angeles. But hold on, you've still got thousands of images to sort through. How do you isolate those with a misty backdrop? Traditional storage systems leave you hanging, with no easy path to perform this fine-grained search. Unless, of course, you could simply click one perfect example and have the system pull up all similar images. Sounds like a utopian dream, doesn't it? Imagine if there was a system that utilized feature embedding to understand the essence of each image. You could then click on one image with the misty backdrop and bicycles you're searching for, and instantly, the system would pull up other similar images. Now, that would be revolutionary, wouldn't it?

Well, prepare to be amazed. For the precise, advanced searching I've been describing, what we need is a tool that handles file metadata with flexibility. In today's data-driven landscape, where the ability to manage, interpret, and act upon data sets innovative companies apart, traditional solutions like data lakes and data warehouses are falling increasingly short. As businesses scale and data complexity rises, what's missing is a tool capable of flexible metadata management that goes beyond basic 'file names' and 'dates'. Think richer, deeper layers like annotations, embeddings, and other forms of advanced metadata that offer the precise, advanced searching and customization you've been craving. MetaLake represents a groundbreaking innovation, addressing the limitations of traditional data management systems. It actively enriches your data, making metadata not just an add-on but a first-class citizen that is both searchable and manageable. MetaLake isn't just an improvement; it's a revolutionary concept that is redefining the very essence of modern data management.

The Future of Data Management : MetaLake

In MetaLake, metadata isn't just an add-on; it’s the core. Metadata—often called the 'data about data'—is your gateway to more effective data management. We're not just talking simple file names and dates, but sophisticated metadata like annotations, embeddings, and custom tags. MetaLake isn't just another place to store your data, it adds a sophisticated meta-layer over your existing data lakes or data warehouses. This added layer brings enhanced searchability, understandability, and usability to your data, something conventional data lakes simply cannot offer. MetaLake focuses intently on better metadata management, lineage tracking, and data governance features.

By concentrating on metadata as a cornerstone of data management, MetaLake doesn't just improve upon existing systems—it redefines what’s possible, setting new benchmarks for efficiency, comprehensibility, and governance. MetaLake is the forward-thinking solution driving the future trends of data management today.

Key Concepts

1. Organize your Data Logically

The MetaLake system distinguishes between metadata and the actual content of files. While the actual file content is stored in large-scale cloud storage solutions, MetaLake retains its metadata. Impressively, MetaLake operates independently of the storage provider where the actual data resides. This independence allows MetaLake to seamlessly integrate with various storage solutions such as Amazon S3, Azure Blob, Google Cloud Storage, and even servers on a local network. Unlike traditional systems that rely on a folder structure to organize data, MetaLake introduces a revolutionary concept called 'logical collections'. These collections aren't bound by folders or subfolders in the actual data storage. Instead, they can be created based on different criteria such as dates, tags, project names, or any other logical groupings like search results that suit an organization's needs. This innovative approach allows users to organize their data into any number of logical groupings while keeping a single, physical data file in the actual data storage.

2. Flexibility of Managing Metadata

One of the standout features of MetaLake is its exceptional flexibility when it comes to metadata management. Our system is designed to accommodate metadata of any data structure, be it simple or complex. Whether you need to store straightforward tags, fields, detailed annotations, or even sophisticated embeddings, MetaLake has got you covered.This unparalleled flexibility allows you to save and search metadata in diverse forms, including text, numerical values, and even nested structures such as objects within objects or arrays within objects. This capability extends to accommodating even specialized formats like DICOM data.

Sources of Metadata

  • Inherent Properties of Files:
MetaLake takes metadata management a step further by automatically generating essential metadata for your files. This includes basic details like file size, resolution, file name, frame rate, and frame count. This functionality is automatic, requiring no extra effort on your part, and ensures you're not starting from scratch when you want to undertake tasks like customized searches, or data analytics.
  • User Defined Metadata:
While MetaLake efficiently generates essential metadata, it recognizes that every user or organization might have unique metadata requirements that go beyond the basics. Imagine you need to find all images taken within California. With MetaLake's custom metadata feature, you can simply add a 'state' field to your files. Then, searching for all images associated with 'California' in the 'state' field becomes a straightforward task. That's where custom metadata comes into play .Think of custom metadata as the fine-tuned details that can immensely enhance data categorization and retrieval. Examples include 'captured location', 'camera specifications', 'weather conditions', and so forth. The beauty of MetaLake's custom metadata feature lies in its adaptability. Regardless of data type—whether simple text or complex nested structures like objects within objects or arrays within objects—MetaLake seamlessly stores it all. Even specialized formats such as DICOM data can be effortlessly incorporated as custom metadata. This user-centric approach empowers organizations to structure their data storage in a way that best fits their unique needs and workflows.
  • Annotations:
Annotation metadata, particularly in the realm of machine vision, holds immense value. While annotations for images are fundamentally metadata, MetaLake treats them with special importance. Annotations essentially provide a detailed description or highlight specific features within images, playing a pivotal role in tasks like object detection, image segmentation, and more. Recognizing this, MetaLake has carved out a dedicated space for annotation metadata, ensuring it's both easily accessible and effectively organized. This focus ensures that machine vision tasks, whether it's training AI models or extracting insights from visual data, are streamlined and efficient.
  • Feature Embeddings:
In addition to the variety of metadata types supported, MetaLake introduces a game-changing capability: the storage of feature embeddings for each image file. Unlike traditional metadata, feature embeddings capture the essence of the image in a mathematical form, which can then be used to compare the similarity between multiple images. This opens the door for incredibly precise similarity searches, enhancing the system's ability to retrieve exactly the content you're looking for, right when you need it. Whether you're looking for pictures with similar themes, backgrounds, or objects, MetaLake's feature embeddings make the search process not just faster, but also far more accurate.

Data Types Supported by MetaLake

3. Data Lineage

One of MetaLake's standout features is its capability for Metadata Propagation. Imagine you have multiple videos, each captured in different locations. From these videos, you've extracted individual frames to create a dataset. You run a machine learning model on this dataset and find that it performs well on some images but not on others. Naturally, you're curious to explore why, and one hypothesis could be that the location where the video was captured influences the model's performance.

Here's where MetaLake's Metadata Propagation comes into its own. You can add metadata like "captured location" to the original video files, and this metadata will automatically propagate to all the extracted image frames that were derived from that video. This eliminates the cumbersome task of manually adding or updating metadata for each individual frame, streamlining your troubleshooting process and giving you rapid insights into the factors affecting your model's performance. In other words, MetaLake makes it incredibly simple to trace back and understand your data lineage, allowing you to make data-driven decisions more effectively.

Advanced Concepts and Features

Multi-Interface Metadata Management

In MetaLake, updating and retrieving file metadata becomes an effortless and streamlined process, thanks to its user-friendly web frontend and robust Python SDK. Whether you're a non-technical user or a seasoned developer, MetaLake has you covered.

For those who prefer a graphical interface, the web frontend offers intuitive options to add, modify, or delete metadata. It features a straightforward user interface where you can easily navigate through your data collections, select files, and input or update metadata in just a few clicks. This makes it perfect for users who may not be familiar with programming but still require advanced data management capabilities.

On the other hand, if you're looking for more programmatic control, MetaLake's Python SDK serves as a powerful tool for developers. With just a few lines of code, you can not only upload or download files but also modify their metadata properties. The SDK allows for bulk operations, enabling you to manage large datasets efficiently. It supports a variety of data types for metadata, making it highly flexible and compatible with your specific needs. This can be particularly useful for automating workflows or integrating MetaLake into existing systems.

MetaData Consistency

When working on a team-based project, the uniformity of metadata can often be a challenge. Different team members might update the same metadata field in varying ways. For instance, one team member might input the field as "captured location," while another enters it as "captured_location." Similarly, when populating these fields, one could enter the value as "Los Angeles," while another inputs "los_angeles."

To mitigate such discrepancies and prevent human errors, MetaLake introduces a robust feature for Metadata Consistency. Whenever a user attempts to input a new metadata field, MetaLake actively informs them if a similarly named field already exists in the system. Additionally, fields can be predefined at the system level with specific data types. This ensures that users not only adhere to consistent naming conventions but also enter data in the correct format. As a result, MetaLake’s Metadata Consistency feature standardizes metadata across the organization, leading to more accurate and efficient data management, and ultimately facilitating seamless team collaboration.

We would love to engage with anyone working on computer vision projects who is struggling to work with a large amount of vision data. Please join our slack channel or reach out to us ( to discuss further.

Get in touch logo.
Get in touch