They’re watching, but are they seeing?
By Gautam Shroff
Notwithstanding the many privacy concerns it raises, the role of video surveillance footage in cracking the Boston terror attack case in a matter of days is well known. Such footage played an equally critical role in tracking down the bombers of the 2005 London attacks. However, in 2005 investigators took weeks to manually sift through about two thousand hours of video footage. This time around, thousands of hours of video were analyzed in barely 48 hours.
The city of Boston is smaller than London; still, it has thousands of surveillance cameras, very similar to the London of 2005. What has changed is technology: video analysis has become significantly more sophisticated in the years since 2005. For example, pre-processing tools are able to filter hours of video footage in, say, an empty subway station at night. Investigators are able to focus only on periods of activity rather than patiently watch footage of an empty platform for hours on end.
Of course, more crowded scenes, especially those as packed as the sidewalks alongside the marathon route require far more sophisticated technology, much of which is still in its infancy. Today there are many commercial video analytics tools that claim to be able to detect a person leaving a bag or backpack and walking away. Such tools are certainly very useful in narrowing down portions of video footage to be analyzed manually during post-incident investigations. But can they reliably alert us in real-time without generating too many false positives? For example, you lay down a brief case and move behind a pillar to find a quiet place to make a phone call. A video surveillance system might well conclude that you have left the scene and your bag is a potential threat. Hundreds of such warnings might be generated every minute — who is to monitor and decide which ones to follow up on?
Another technique that has seen significant advances in recent years is tracking moving objects in videos, especially human beings. Further, it is now possible (only barely though), to track the same person as he moves across large distances as he moves in and out of the field of view of multiple cameras. So, in principle, a hypothetical `big brother’ central server that processes feeds from multiple cameras should be able to track anyone suspected in a ‘left bag’ event and verify whether they rapidly walk away from the scene or not. Of course, bandwidth remains a limitation, which is why many video analytics solutions rely on local ‘event detection’ at the camera level so as to minimize transferring too much data across a network. Further, in such situations, different cameras need to be ‘told’ to track a ‘particular’ person seen by another camera, and that too in a bandwidth efficient manner. So much work remains to be done for efficient large-scale multi-camera tracking.
But there is more: Many recent terror attacks, especially in India, share a similar modus operendi — the terrorist leaves his dangerous cargo on a bicycle that he parks in a crowded market and walks away, seemingly on an innocent shopping errand. Should our central server raise an alarm? After all, many people genuinely shop while their two-wheeled vehicle, bicycle or motorbike, lies parked nearby, perhaps also loaded with their recent purchases. Do we warn citizens of dire consequences if they leave packets on their bikes?
Clearly our central server needs to work harder, track more people, for longer. Most importantly, it needs to reason. However ubiquitous video cameras might be, they still cannot be everywhere — certainly not in every store, restaurant, or loo! The central server would need to explain away the actions of most of the people it tracked, and narrow down on only a few, such as someone entering a subway station, leaving a bag and then boarding a train. (Such ‘explaining away’ to home in on the right answer is an example of ‘abductive reasoning’. If it appears difficult for a machine to mimic, take note that just such reasoning has in fact already been used by IBM’s Watson program that won the 2009 Jeopardy! competition.)
Moreover, how might the video surveillance servers of the future come to know what is normal behavior and what is not? Certainly it would be impossible to catalogue every instance of normalness for the machine to ‘look up’ and compare against. Instead, the machine would need to learn, using massive amounts of ‘normal’ video footage. Difficult, but by no means impossible any more. Consider this: each year over 15 million hours of video is uploaded onto YouTube. In contrast, a human being is exposed to barely half a million hours of ‘video experience’ over a lifetime (90 years × 365 days × 16 hours/day). Yet we learn, and rather early on, the difference between normal and abnormal, be it suspicious or merely eccentric. Granted that eccentricity is not entirely absent from YouTube videos, still, there is more than enough ‘normal’ video available today for machines to learn from, if only they knew how.
Intelligent systems such as our hypothetical central video-surveillance server need to go beyond merely looking at the world while watching us. They also need to continuously learn from the data they experience, so as to see and focus on what is actually important. Only then can they connect the dots and make reasonably accurate predictions, so corrective action can be taken in time, and not only after a tragedy has occurred.
Finally, the cycle we just described above: Look, Listen, and Learn, so as to Connect, Predict and finally Correct, will be a common feature of the highly connected ‘web-intelligent’ systems of the not too distant future, be they for video surveillance, self-driving cars, or even the smart-grid.
Gautam Shroff is Vice President & Chief Scientist, Tata Consultancy Services and head of the TCS Innovation Lab in Delhi, India. He occasionally teaches in an adjunct capacity at the IIT Delhi and IIIT Delhi, as well as online via Coursera. He is the author of The Intelligent Web: Search, smart algorithms, and big data.
Subscribe to the OUPblog via email or RSS.
Subscribe to only technology articles on the OUPblog via email or RSS.
Image credit: High tech overhead security camera at a government owned building. Photo © trekandshoot via iStockphoto.