Where’s the API that can tell me that this photo contains a puppy and a can of Coke?

puppy and a can of coke

Photo by Ahmad van der Breggen on Flickr.

We’ve gotten very good at extracting and disambiguation entities from text data. You can license a commodity system, and there are API and even open source tools that work fairly well.

However, a large percentage of content that people share is not primarily text (a back-of-the-envelope guess says around 18%), and we currently have very little automated insight into that content.

I know this is a very hard problem, but I’m continuously surprised by how few people seem to be working on it. Any ideas?


18 Comments on “Where’s the API that can tell me that this photo contains a puppy and a can of Coke?”

  1. It’s not the same but might be an interesting read for you, these these stack-overflow/math posts were really interesting in terms of quite focused pattern/feature identification – worth a skim.

    http://stackoverflow.com/questions/8479058/how-do-i-find-waldo-with-mathematica
    http://mathematica.stackexchange.com/questions/11819/help-find-a-bright-object-on-mars

  2. Jason Bell says:

    Apart from crowdsourcing the info manually the next step would be to mine EXIF data from the image itself. Fine if a professional photographer took it, marked it and populated the caption data. So something like Apache Tika would fit the bill if the meta data was stored somewhere.

    After that, well it’s question I’ve asked myself for a long time.  The API’s for Flickr and Instagram would be the ones to watch. But it’s down to the basic “garbage in garbage out”, when it comes to tagged information. It’s only as good as the person that put it there.

  3. Anonymous says:

    It’s been mentioned down here in the comments already, but this seems to me like the type of problem that’s better suited to humans than it is to algorithms. I’m sure one could write some sort of recognition algorithm tuned to recognize Coke cans (or puppies), e.g: http://www.slate.com/blogs/future_tense/2012/06/27/google_computers_learn_to_identify_cats_on_youtube_in_artificial_intelligence_study.html

    But personally, writing individual algorithms designed to identify every single possible thing in a photo that you’d want to identify seems like a lot of work and tweaking when you could just, well, ask someone.
    http://voxilate.blogspot.com/2009/10/batching-mechanical-turk-jobs-at.html

    • Hilary Mason says:

      Unfortunately, we’re at a scale where even mechanical turk is unaffordable. I would love to find a hybrid system, that uses human tagging when algorithms fail and then uses that data to retrain new classifiers, but I haven’t found it yet!

      • Emery Berger says:

        Hi Hilary,

        In principle, this sounds like a great fit for AutoMan (quite refined now, just appeared at OOPSLA 2012) – automan-lang.org. My student is busy on something else right now, but wrapping it with a REST API is on his list of things to do. The beauty of AutoMan is that it automatically handles quality control, so you get high confidence ground truth, which you could then use to train up qualifiers.

        — emery

  4. John says:

    I point Google Goggles at the picture, and I get “Yorkie Dogs” and “Coca-Cola”.  Now just convince Google to make that available as an API. :)

  5. […] In addition to building great open-source and enterprise software, Continuum also consults with data science companies and organizations. One company we recently contracted with is Stipple. Many of those that attended PyData last month met Stipple’s Chief Science Officer, Davin Potts. In Davin’s talks, he described in detail the enormously difficult challenge Stipple is trying to solve. Simply put, Stipple wants to index every image on the web, extract features from those images and identify them. A feature could be anything from a pair of pants or a shirt sold by a large clothing company, the latest electronic device, or a picture of a recent cooking creation tagged by a user. Using Stipple, you could even potentially disambiguate a can of coke from a puppy. […]

  6. George Purkins says:

    Curalate does this for brands: http://www.curalate.com/

  7. Matt Mcknight says:

    We’ve done some work with piXlogic in this arena. http://www.pixlogic.com/

  8. Dave says:

    Mobileworks, Crowdflower, Mechanical Turk, and Samasource all have APIs that can handle image tagging work. 

  9. Matthew Krieger says:

    Mechanical Turk is that API for now.

  10. Doug says:

    In a perfect world, are you looking for an answer to, “Does this picture have a dog in it?” or “What’s in this picture?” Assuming the latter, are you looking for some sort of importance measurement too? e.g. “What are the 3 most prominent features of this picture?” Or the whole kit-and-kaboodle: “What are the 1000 words that this picture is worth?”

    In retrospect, probably a silly question considering the “data scientist” facet, but I’ve already typed this, and am now in mid-mind-churn.

  11. Sergey Vershinin says:

    Google’s search by image gives a best guess of “coke can puppy”.  Don’t know if the results can be scraped, but I don’t see why not.

  12. Siderite Zackwehdex says:

    There are all sorts of possible problems, and I don’t mean technical. Remember Facebook with their idea to tag pictures with the names of the people in them based on face recognition and the privacy backlash that ensued.
    I doubt people would look for dog and Cola more than for naked chicks (you know, those just hatched and featherless)

  13. Jason Bliss says:

    Amazon’s mechanical turk api