Back

The complete guide to AI Data Platforms

Buddhika Madduma
September 26, 2023

Introduction

You must be wondering how big enterprise companies like Tesla and Open AI are making significant progress on large scale AI projects like Full Self Driving (FSD) vehicle systems and DALL-E generative images. The secret is they understand very well the fundamental principles of building AI models and then created powerful workflow processes to make it happen.

The key for AI is data - no data, no AI. The AI research community, including Andrew Ng, believes that AI is a data problem. If the current state of the art algorithms are fed sufficient high quality data during training, they can produce high performance models that can be deployed into production systems.

Additionally, the Tesla AI team includes ML Engineers, Infrastructure Engineers, Data Labellers and Data Collectors to operate their Data Engine and the whole AI model development process. Each department has its own dedicated responsibilities to make the final outcome more efficient. Having a collaborative system within the team makes them more efficient and reduces errors.

As you can see with the Tesla example, purpose-built Data Tools and Data infrastructure are key to develop production deployable machine learning models. Large enterprises like Tesla, Google and OpenAI have resources and expertise to build Data Engine solutions in-house. However startups and other non-tech companies who are working on AI projects are left with lackluster tools as they don't have expertise or resources to build such solutions in-house.

Let’s deep dive in how most AI teams operate today without having a proper Data Engine. An imaginary health tech company who wants to track activities in the Operating Room (OR) using Computer Vision and AI starts to collect video data from their hospital partners. Any health related data by default triggers a number of HIPAA requirements. The AI team in this company stores unstructured video data collected from the OR in a HIPAA compliant AWS S3 bucket. They realize that they can’t even view video data stored in S3 cloud buckets without building HIPAA-compliant custom tools.

Their next problem is how to find or search for data in S3. General purpose storage buckets like S3 don't hold any metadata about stored files. Although it supports searching based on file name, there is no way to search based on content, recorded location, date or other metadata.. The team starts to use a Google spreadsheet to keep file reference information and some metadata about the content inside the file as a solution. It solves the problem to some extent, but the solution is very error prone, inefficient and suboptimal.

After finding data from an S3 bucket, the AI team wants to send the dataset to annotation for labeling. The team creates another folder inside the S3 bucket and copies the curated data. This obviously duplicates data and is wasting resources. Then they share a link of the newly created folder with the annotation team.

Without purpose built tools they were struggling to collaborate and manage this pipeline process to collect, search, curate, label and train. Most of the valuable time of an ML engineer is spent on data preparation and data management tasks.

Key Highlights

  1. It is difficult to track and search unstructured vision data (images and videos) stored in cloud buckets like S3 or other storage systems without having a proper metadata layer.
  2. Data duplication is inevitable during each stage of the pipeline process without having purpose built data tools.
  3. Collaboration between the AI team, Labeling team and various other stakeholders is a must to have an efficient model development process.
  4. Data security should not be compromised while sharing data between teams. Data governance and control is important.
  5. AI Teams spend 70% of their time on data operations due to not having access to purposely built data tools and infrastructure.

How can a purpose built, best in class Data Platform help?

Any teams in large enterprises or startups who work on serious computer vision and AI projects can significantly benefit from having purpose built data platforms. It allows AI teams to focus more on solving AI solutions and algorithms rather than solving data problems. Additionally it gives organization leaders peace of mind and confidence about data security.

Understand and Visualize your data

It is extremely important for any organization to have a good understanding of its available data. Knowing what’s in raw datasets can help organizations leverage potential competitive advantage by making great products or improving efficiencies.

Companies are collecting and accumulating so much data, however the majority of unstructured data such as videos and images are stored in multiple cloud buckets or storages without being tracked. Over time, these buckets become data junk yards impossible to utilize when it comes to Terabytes of data. No one knows what they have and what they don’t have.

The first step is to unify all data sources (all cloud buckets) into one infrastructure and allow the team to easily visualize and search data as they need. In order to facilitate the search and visualization, we have to track metadata of every object stored into the cloud bucket at the outset. Once you have one unified data infrastructure (MetaLake), everything else can be streamlined. For example, data consumers like Curation, Annotation and other applications can easily search and access data efficiently.

Benefits of One Unified Data infrastructure

  1. Search and visualization become so much easier with the metadata layer of data.
  2. Know your data. Everyone on the team is up to date about their dataset and have a good understanding about what they have and what data they might need to obtain.Reduce or completely eliminate data duplication.
  3. Facilitate data lineage and provenance.
  4. Collaboration between AI, Labeling and Data Collection teams is efficient and reduces costly mistakes.

<p-bold>LayerNext MetaLake<p-bold> provides a holistic view of all your enterprise’s data assets by effectively managing unstructured data. It extracts and stores labels and metadata of unstructured data in cloud buckets to facilitate fast search and visualization.

Features

  1. Fast search with dynamic queries with labels and metadata.
  2. Data Inflow and Outflow via SDK and REST API.
  3. Easy integration with AI pipeline processes or 3rd party Apps.
  4. Model observability with model run output and ground truth comparison.
  5. Remove data duplication but facilitate data lineage and data provenance.

Label data quickly and efficiently

Once you identify the raw data from the MetaLake that can improve your model performance the next step is to label that raw dataset and create a training dataset. Data Labeling is an expensive and time consuming process. Some companies outsource labeling tasks to labeling service providers. However in some cases for security reasons, internal labeling teams must be employed for data labeling. In either case, it is critically important to have efficient data labeling tools that allow multiple teams to collaborate during the labeling process.

<p-bold>LayerNext Annotation Studio<p-bold> is designed to support labeling at scale with a large workforce with flexible and customizable workflows. The Studio can easily access data in the MetaLake to create annotation projects and also return back all annotated ground truth data to the MetaLake.

Features

  1. Image and Video annotation.
  2. Model assisted annotation.
  3. Attribute Level annotation.
  4. Bounding Box, Line, Point, Polygon annotation.
  5. Semantic Segmentation.
  6. Customizable workflows.
  7. Team collaboration.

Manage Training Data with versioning

The next important pipeline stage is training dataset creation with annotated data for machine learning model training. Training datasets can be generated based on model input requirements. For example, segmentation models like Unet require a segmented image mask along with original image, while YOLO-like object detection models require original images plus annotation in a text file. Usually AI teams have their own scripts to convert annotated data to training dataset based on their model choice. As we already mentioned, AI model development is an iterative process until we’ve achieved the required accuracy. So the training dataset is evolving over time. It is important to keep the dataset version along with the model’s resulting accuracy for comparison. Usually teams use google sheets or excel files to keep track of the training dataset version.

<p-bold>LayerNext Dataset Manager<p-bold> eliminates custom scripts and google sheets that teams need to maintain. It has in-built functions to generate annotated data for many model types. Automatic dataset versioning allows teams to seamlessly track changes to datasets and their related model training usage and accuracy. Furthermore, there is no duplication of data due to the tight integration between Dataset Manager and MetaLake.

Features

  1. Dataset version control.
  2. In-build data augmentation.
  3. Convert and format annotations based on model.
  4. Auto balancing Training, Validation and Testing datasets.

Integrate easily with your tech stack.

Who benefits from a best in class AI data platform?

To build high performant AI models and applications, organizations must equip themselves with the necessary tools and processes to collect, curate, annotate and manage training data at scale. A purposely built data platform can yield significant benefits for teams at any size of company who is seeking to deploy AI models to production faster.

Within the organization, different stakeholders receive different benefits of using a best in class data platform.

<p-bold>Technical leaders<p-bold> like heads of engineering or AI directors will be able to manage all AI data and processes in one unified platform. All in one enterprise grade compliance ready platforms give confidence and peace of mind about data security.

<p-bold>ML engineers<p-bold> will be able to quickly find raw or labeled data with metadata search to create annotation projects or training datasets. They can also integrate and automate data pipelines with AI model development.

<p-bold>Data scientists<p-bold> will be able to start new AI experiments quickly or analyze model behaviors by comparing ground truth and model run data. Seamless data access and visualization make it easy for data scientists to analyze experiments.

<p-bold>Labeling teams<p-bold> will be able to efficiently label data with easy to use labeling tools. The workflow and task queues are designed to support labeling teams to work with minimum supervision in a large workforce.

<p-bold>Business leaders<p-bold> will be able to track areas of AI investment to key business priorities to better correlate ML projects to business value, as well as accelerating the development and maintenance of high-performing production AI to gain and keep a competitive edge.

We would love to engage with anyone working on computer vision projects who is struggling to work with a large amount of vision data. Please join our slack channel or reach out to us (buddhika@layernext.ai) to discuss further.

Get in touch logo.
Get in touch