I’ve been watching the evolution of computer vision closely and things are getting VERY interesting right now.
We’ve always wondered when computers can think like humans, and it’s always remained that elusive “20 years away”. To really interface with us, they need some KEY things:
#1 TO REMEMBER (STORAGE & RECALL)
#2 TO LISTEN & COMPREHEND (MICROPHONES & COMPUTE)
#3 TO SPEAK (SPEECH GENERATION)
#4 TO SEE
#5 TO THINK (COGNITION, EMOTION & CREATIVITY)
Of those 5 things, the first three are pretty much nailed, #4 is next.
The evidence that #4 is elusive is Siri on the iPhone, or Microsoft’s Cortana don’t ask to see what you’re talking about. Siri should say “Can you show me that?“.
When given the chance to see, the data reveals the number one question people ask a seeing computer about is problems on their body. They worry about rashes etc. It’s interesting as it shows how most companies are working on the wrong problem, learning Starbucks, Mercedes and Nike logos from all angles won’t get you there.
Looking at a simple coffee cup… Algorithms today focus on the Starbucks logo, and respond with offers of Starbucks products like “Starbucks iPhone Case”. Huh? At least identify it as “a white Starbucks Ceramic Mug”, why are you just showing me iPhone cases?
The step that all the companies fail at is when you break the mug… Even a kid would say “A broken mug”, but after countless millions of dollars every research tech fails to nail it.
I’ve seen multi-multi-million dollar systems analyze this kind of image above and return the word “Creamy“. What?
I have friends at Cloudsight.ai that have avoided the typical “buy data sets and crunch them” model as they knew they need cognition (understanding & comprehension) of image concepts. They have an open API and are being used in numerous 3rd party applications today, processing countless millions of images from real people.
Most companies are working on straight “recognition” and I get it, I’m a programmer and I also love to think that programming can get us there alone, but it can’t. It’s like the visual researchers are following the old path of audio researchers by trying to recognize individual words, that have no context.
I remember Bill Gates talking about voice recognition once, he explained just how difficult is is to understand “Do you recognize speech?” vs “Did you wreck a nice beach?” Even if a computer gets the words right, getting it to understand the question was a massive problem. So only when researchers focused on understanding context did things start to leap forward for the cloud being able to listen.
The reason the CloudSight.ai solution is interesting is because they’ve spent years working on the parental teaching loop that human brains require to grow. The reason the kid can understand the broken mug is because they understand the concept BROKEN, they broke things and saw how they break. They will see more and more evidence of this idea growing up and can recognize BROKEN in any form. There’s lots to learn… “Those glasses are broken“, “Those glasses are old“, “That person looks ill” etc.
Every single major corporation (Apple, Google, Microsoft, Facebook, Pinterest etc.) will need to either have the cloud see or to understand the billions of images and videos they are handling. It’s a certain future and it’s fun watching the progress.
I just gave it a fun image to try… It nailed it.
Here’s someone testing all the top solutions:
So keep an eye on this space, it’s about to get very interesting.