Yesterday I asked people on twitter for recommendations for things to read to improve as a programmer. I’m looking mainly for things on the philosophy side of software engineering. I do realize that practice is the most important thing, but sometimes you run into a design question and it’s always helpful to realize that very smart people have, indeed, thought about these things before.
If you see something that you think should be included, please do let me know in the comments and I’ll add it to the list.
The talks are a wide perspective on the interesting work happening around data in New York, and I believe you’ll enjoy all of them!
The New York Times has a story this morning on the growing use of mugshot data for, essentially, extortion. These sites scrape mugshots off of public records databases, use SEO techniques to rank highly in Google searches for people’s names, and then charge those featured in the image to have the pages removed. Many of the people featured were never even convicted of a crime.
What the mugshot story demonstrates but never says explicitly is that data is no longer just private or public, but often exists in an in-between state, where the public-ness of the data is a function of how much work is required to find it.
Let’s say you’re actually doing a background check on someone you are going on a date with (one of the use cases the operators of these sites claim is common). Before online systems, you could physically go to the various records offices, sometimes in each town, to request information about them. Given that there are ~20,000 municipalities in the United States, just doing a check would take the unreasonable investment of days.
Before mugshot sites, you had to actually visit each state’s database, figure out how to query it, and assemble the results. Now we’re looking at an investment of hours, instead of days. It’s possible, but you must be highly motivated.
Now you just search, and this information is there. It is just as public as it was before, but the cost to access has become a matter of seconds, not hours or days, and we could imagine that you might be googling your date to find something else about him and instead stumble on the mugshot image. The cost for accessing the data is so trivial that can come up as part of an adjacent task.
The debate around fixing this problem has focused on whether the data should be removed from the public entirely. I’d like to see this conversation reframed around how we maintain the friction and cost to access technically public data such that it is no longer economically feasible to run these sorts of aggregated extortion sites while still maintaining the ability of journalists and concerned citizens to explore the records as necessary for their work.
Everyone does realize that it's not about teaching people to CODE as much as it is about teaching people to THINK … right?
— Hilary Mason (@hmason) September 17, 2013
I’m a huge fan of the movement to teach people, especially kids, to code.
When you learn to code, you’re learning to think precisely and analytically about a quirky world. It doesn’t really matter which particular technology you learn, as long as you are learning to solve the underlying logical problems. If a student becomes a professional engineer, their programming ability will rise above the details of the language, anyway. And if they don’t, they will have learned to reason logically, a skill that’s invaluable no matter what they end up doing.
That you can apparently complete a three month Ruby bootcamp and get a job today is an artifact of a bizarre employment market, and likely unsustainable. But by dedicating three months to learning to think in a logical framework, you’ll also gain an ability that will open opportunities for you for the rest of your life.
Registration is open for DataGotham 2013, our second annual New York data community conference, September 12th and 13th. The core of the conference is a series of brilliant data practitioners telling the stories about what they work on. The content is technically-oriented but not all deeply technical, and we really welcome anyone curious about how New York companies and institutions are pushing the boundaries on data to attend.
We have two goals for the conference. The primary goal is to connect people in the greater New York data community who are working on interesting things. If our community is strong and supportive, we will all do better work.
Our second goal is to highlight the amazing working happening here, so that people near and far will realize that New York is the best place in the world to do data science.
Come join us to hear these stories firsthand and meet fellow data-minded practitioners! Register now:
(Readers of this blog can use discount code “IheartNYC” for 10% off, and I hope to see you there!)
In 2008, cuil, a search engine startup, displayed my bio alongside a photo of deceased actress Hilary Mason. In January 2013, Bing confused us, this time putting my photo next to her bio (they fixed it after a suitable amount of mocking on Twitter).
Today I win the internet?
If you zoom in on the bio section, you can clearly see that it’s her bio with a photo of me (originally from Crain’s New York 40 under Forty). Further, if you go into her filmography, you continue to see my photo.
I’m most proud of my starring role in the amazing film Robot Jox. (bottom right of the image below)
I know that entity disambiguation is a hard problem. I’ve worked on it, though never with the kind of resources that I imagine Google can bring to it. And yet, this is absurd!
Data engineering is when the architecture of your system is dependent on characteristics of the data flowing through that system.
It requires a different kind of engineering process than typical systems engineering, because you have to do some work upfront to understand the nature of the data before you can effectively begin to design the infrastructure. Most data engineering systems also transform the data as they process it.
Developing these types of systems requires an initial research phase, where you do the necessary work to understand the characteristics of the data, before you design the system (and perhaps even requiring an active experimental process where you try multiple infrastructure options in the wild before making a final decision). I’ve seen numerous people run straight into walls when they ignore this research requirement.
Forget Table is one example of a data engineering project from our work at bitly. It’s a database for storing non-stationary categorical distributions. We often see streams of data and want to understand what the distributions in that data look like, knowing that they drift over time. Forget Table is designed precisely for this use, allowing you to configure the rate of change in your particular dataset (check it out on github).
Google Now is an extension to Google’s Android search app that uses all of the data that Google has about you along with what it can guess about your current context to present the information it thinks you need when it thinks you need it.
It’ll tell you to leave a bit early to make your next calendar event because of heavy traffic, or that it’s a friend’s birthday, or that there’s a cool cafe nearby where you are.
I think it’s amazing.
It’s amazing because this is the first Google product that takes ALL OF THE DATA that they have about us and actually makes it useful for us. Not for advertisers.
I’m gathering a bundle of data science blogs to share. I’m looking to include blogs that update regularly and aren’t either personal opinion and project blogs (like this one) or primarily about marketing any particular company. Let me know if you have a favorite that I’ve forgotten.
If you’re just looking for one place to start, hop on over to Simply Statistics.
I ended up at NYC Resistor on Sunday, and decided to experiment with physical visualization of some data. I grabbed the clicks per second on keyphrases including my name (“hilary mason”) over the last six months, aggregated them by day, and made this graph:
This is easy enough to construct for any phrase using the clickrate data that we’re calculating at bitly. I exported it from matplotlib in svg, added a label, and used the laser-cutter to create this out of plywood:
…which will shortly be adorning my desk at work. This is very simple, but there’s a lot of fun to be had with the physical manifestation of patterns we see in large amount of ephemeral data.