There is a buzz right now for AI with the recent development of chatGPT and generative AI such as Stable diffusion. It looks like now AI is going to eat the world soon. While NLP and generative AI are solving important problems, visual AI or computer vision is going to play an even greater role in our lives. It is possible in the near future computers and cameras will take over any tasks that require human eyes or brains. Even today without us knowing we use products that use computer vision and AI. YouTube and Facebook use computer vision to moderate content on their platforms for safety. ADAS (Advanced driver assistance system) uses computer vision to detect lanes on the road. On the other hand, some startups and established companies are trying to solve complex problems using computer vision, such as self-driving cars and checkout free shops.
The combination of CNN and Deep Learning has brought CV to its current state in the last decade. In some cases, YOLO and Fast R-CNN like models have performed better than humans. It is the combination of better algorithms and better training data that makes this possible. As of today, supervised learning algorithms are at their best. However, their accuracy is affected by how much training data they receive. As an example, Tesla and Waymo self-driving models have been trained with millions of training images just to detect traffic lights. This is because traffic lights vary in appearance across different cities and can be affected by varying environmental conditions such as daylight, nighttime, rain, snow, etc. To account for all these cases, a vast amount of data is required. According to deep learning pioneer Andrew Ng, today, computer vision is primarily a data problem.
The other factor that helps computer vision to achieve this remarkable success is the improvement of hardware. For example in the case of self-driving cars or checkout free shops required certain frame rates in order to produce desired outcome. NVIDIA Jetson-like GPU enabled Edge computers made it possible to develop real-time applications.
It is impossible to imagine how many images and videos are captured per day in the world by various capturing devices. 1.7 trillion images are created each year by more than 6 billion people using smartphones. There are 1 billion surveillance cameras installed around the world, making it impossible for humans to view them. In addition, most modern devices from cars to refrigerators come with one or more image capturing devices. The huge volume of visual data generated by these devices will enable new applications and possibilities.
As an example, your refrigerator will send you a notification to buy milk when you are on your way home from work as you are running low on milk. Computer vision and artificial intelligence are going to make the future incredibly exciting.
The current state of supervised learning algorithms forces us to follow a Data Centric approach to computer vision. As a result, companies collect enormous amounts of visual data. In many cases, computer vision teams store visual data in cloud storage buckets like Amazon S3 or Google Storage, and raw captures are just dumped into these buckets without proper organization. When datasets grow, it becomes more difficult to find the required data within the bucket, since they would need to manually search each file.
Due to the nature of this issue sometimes teams have to spend 70% of their time in data preparation, leaving little time for understanding and tuning the model. To tackle this issue, large tech companies like Tesla, Waymo, and Etsy have dedicated data and infrastructure teams. On the other hand, smaller startups and non-tech companies are struggling to effectively manage their data without proper tools and infrastructure, often resorting to creating their own, suboptimal solutions.
We are a passionate team of AI enthusiasts who fell in love with the field during our undergraduate studies in 2002-2006. At that time, AI was not as popular as it is today. Our initial project was to develop a mobile robot that could respond to voice commands using a simple feed forward neural network. The robot was taught to recognize some words "forward," "reverse," "left," and "right," as well as the command "stop." It was successful to a certain extent, but not always. This experience marked the beginning of our AI journey. During the deep learning revolution, we have been involved in many computer vision projects across many different sectors, such as healthcare, automotive, and retail.
We recently developed a computer vision solution for an OEM agricultural machinery company. They have two separate teams working on computer vision projects. The challenge in this AI use case is that the same machine must function in vastly different environments, from the fields in northern cities of Canada to the southern regions of Phoenix, Arizona. Additionally, the machine must be able to support various crop types such as soybeans, corn, and strawberries. In terms of computer vision, these conditions present unique challenges and require a large amount of data to be collected in order to build a model that can be deployed in the field.
In a similar project in the healthcare space, we used computer vision to automate event tracking in operating rooms. The data was collected from cameras placed in different hospitals and stored in a HIPAA-compliant S3 bucket. However, viewing the data was difficult as downloading it to local machines was prohibited due to HIPAA rules. To overcome this, we had to build custom HIPAA-compliant tools to preview the data in S3. The issue is cloud storage solutions like S3 are designed as general purpose storage containers. In order to use them in computer vision we have to design another layer on top of them.
As there is no solution in the market for this problem, we decided to solve it. Our first step would be to connect all data storage sources together to form a single unified infrastructure called MetaLake. It is a middleware layer that sits on top of cloud storage buckets and holds the metadata of each file. The middleware layer's basic function is to provide powerful yet simple search capabilities, as well as to control the input and output streams into the cloud buckets. So every file coming in and out of the cloud is automatically tracked in the MetaLake. Data collection and ingest processes can be directly integrated with our Python SDK and REST API in the upstream, as well as other data operations tools such as annotations and curating in the downstream.
We would love to engage with anyone working on computer vision projects who is struggling to work with a large amount of vision data. Please join our slack channel or reach out to us (firstname.lastname@example.org) to discuss further.