Lucene Revolution Keynote: Search is Not a Solved Problem

The wonderful folks at LucidWorks have posted the video of my recent Lucene Revolution keynote.

The brief idea behind this talk is that search is not a solved problem — there is still a big opportunity for building search (and finding?) capabilities for the kinds of questions that the current product fail to solve. For example, why do search engines just return a list of sorted URLs, but give me no information about the themes that are consistent across them?

The audience was technical, specifically Lucene and Solr devs, so I spent some time talking about how we use those technologies at bitly.


Et tu, Google?

In 2008, cuil, a search engine startup, displayed my bio alongside a photo of deceased actress Hilary Mason. In January 2013, Bing confused us, this time putting my photo next to her bio (they fixed it after a suitable amount of mocking on Twitter).

Today, Google did the same thing. (live search link)

Today I win the internet?

Screen Shot 2013-04-14 at 4.59.24 PM

If you zoom in on the bio section, you can clearly see that it’s her bio with a photo of me (originally from Crain’s New York 40 under Forty). Further, if you go into her filmography, you continue to see my photo.

I’m most proud of my starring role in the amazing film Robot Jox. (bottom right of the image below)

robot_jox

I know that entity disambiguation is a hard problem. I’ve worked on it, though never with the kind of resources that I imagine Google can bring to it. And yet, this is absurd!

Note: It’s also been pointed out to me that there’s a slim possibility that Google’s confusion stems from my own post about Bing’s error, in which case, this post will certainly make the confusion worse. To that I say — bring it on, technofuture irony!

 


Conference: Search and Social Media 2010

I recently attended the Third Annual Workshop on Search and Social Media, an academic workshop with very strong industry participation. The workshop was packed, and had some of the most informative and interesting panel discussions I’ve seen (not counting the one I spoke on!).

Daniel Tunkelang did a great job of writing up the specific presentations on his site and on the ACM blog, so I won’t attempt to re-create the presentations line by line at this late date. Rather, I’d like to highlight a few open problems and research questions that came out of the discussions that I hope to see developed in the next year.

Social search consists of a set of problems including (but hardly limited to) search of social content like status updates, real-time search, generating, labeling, and finding user-generated content, ‘long-tail’ events and interests, finding vs re-finding, and trend identification.

What data is available to social search? There are many kinds of social data, from e-mail (private) to blogs (public) and tweets (mostly public) — what is and should be searchable? How do we handle issues of privacy and identity management?

How do we compute relevance, taking into account freshness, accuracy, and degrees of social separation?

Will the architecture of these search engines look like the search engines we’re currently familiar with?

How do we evaluate accuracy and truthiness of social data?

How do we characterize social connections, around concepts like strong vs weak ties, and friend-of-a-friend vs friend-of-a-friend’s-friend? Can we converge on a single social graph representation?

How do we best filter social data to lead to accurate recommendations for content discovery? How do we accommodate the fact that as we move beyond static factual data, two people using the same query may be looking for very different results?

Finally, how do we deal with the chasm between the industry participants (who have LOTS of data) and the academic participants, who suffer from a lack of public (and publishable) data?

Thanks again to the organizers – Eugene Agichtein, Marti Hearst, Ian Soboroff, and Daniel Tunkelang – who put together a fantastic event.

For more on this and a cool demo, check out Gene Golovchinsky‘s look at the SSM2010 twitter coverage.


Tip: How to Search Google for Ideas

Will someone please invent a way to search for ideas?

Short on inspiration? Harvest ideas from the web by searching Google for “someone please invent” and see what people are wishing for. Using quotes around “someone please invent” insures that Google searches for that exact phrase only.

You can further refine the query by adding related terms at the end of the query. For example, try searching for “someone please invent” game* to see game-related results.


Teaching Search Techniques with Google Games

Educators routinely discuss how students have trouble evaluating and using the results of their Google searches. There are really two parts to this problem, though, and while it’s true that students may struggle to identify reliable sources, before we can address that, we need to teach them how to write good queries.

It’s that old computer science maxim: Garbage In, Garbage Out.

I like to teach students how to write interesting queries by playing games. This games force students to think about the queries they are writing, and not the results. I have no scientific proof of the results, but I do know that it keeps them entertained and thinking for a while!

My favorite games:

  • Google Whack – The classic! Find a two-word query, with no punctuation, that return one and only one result. The Google Whacks on the site would make great spam subect lines.
  • Google Image Labeler – A game that Google created. You are matched up with a random partner, and together, presented with images. You guess labels for the image, and when you and your partner match, you get points and move onto the next image. Each round lasts two minutes.In addition to providing a tool for procrastination, this is one way for Google to automatically provide appropriate text labels to images. It gets the group thinking about how the search engines work!
  • What’s more popular? With Google Trends! – Create small teams. They each get to pick a term, and compare the popularity using Google Trends. Teams pick words at the same time, and the team with the best two of three wins. You can optionally restrict the domain; for example, all guesses must be vegetables.
  • Finally, Googlenope – Proposed by Gene Weingarten in today’s Washington Post, a Googlenope is a search term or phrase that does not exist on the web. Until you find it and write about it, that is!

If anyone has a favorite that I’ve missed, please comment!


The Best Time to Search for Academic Jobs

It’s common knowledge that academic job announcements are seasonal. In general, hiring committees are formed in the fall; they announce positions, wait a month or two for applications, then spend weeks interviewing candidates before making a decision in March or April for positions that will begin the following September. I found some data to prove it, and to possibly guide those engaged in an academic job search.

I was playing with Indeed.com‘s job trends feature, when I realized that you could search not only for particular skills and specializations, but for job categories. A search for “professor” reveals some nice peaks right around mid-October.

professor trends

While a search for “postdoc” isn’t quite so periodic.

postdoc trends

I can only guess that this can possibly be attributed to trends in funding for postdoctoral positions. It would be interesting to see if there is a correlation with NSF funding for science research.

We know it’s good to look for professorships in October, but does this hold true across all fields? Searching by a generic field name (“physics”, for example) doesn’t do much good, as it picks up all of the job posts that are looking for majors in that area. But we can look for specializations. For example, the graph for “superconductivity”, we see:

superconductivity trends

Which shows that the majority of positions are announced between October and January, and that the summer is the worst time to find one.

I tried to think up specializations in the humanities, to see if the pattern held. A search for “egyptology” gives us this graph:

egyptology trends

I’ll allow you to draw your own conclusions.