Part 2: The Lucky Break Scoreboard
This Post has No Comments
Last week, Infochimps CTO Flip Kromer introduced his truth on the failures that led to the successful acquisition by CSC in his blog post, Part 1: The Truth – We Failed, We Made Mistakes. Flip continues his blog series with Part 2, his love letter – the real Infochimps story.
7 years ago, having switched majors from Computer Science in college to Physics in grad school, and failing twice to successfully execute a plan of research in Physics, I decided to switch to Education – my favorite part of grad school was teaching. A year before, my ever-patient advisor, physics professor Mike Marder, had started a wildly successful alternative program for a public-school teaching certification. It replaced a full general education curriculum with frequent in-classroom experience and focused education classes – and it let me reuse the scientific coursework I already had way too much of.
A year later, I was near the end of the program and preparing my teaching portfolio, which led me to spend a lot of time thinking about what I wanted my students to learn, and why. For many of them, my course would be their last formal chance to acquire the skill of quantitatively understanding their universe. As I started to write (less bluntly), I had no interest in burdening them with three different forms of the quadratic equation, or pretending that as a practicing physicist I’d ever used the formula for the perimeter of a trapezoid.
What they should be learning was the ability to make use of a complex information stream, understand sophisticated information displays, and extract straightforward insight using tools such as … … ‽‽
I paused, struck, mid-sentence. Those tools do not exist. Not for a high school student, not for a domain expert in another field, and only after years of study, for me. That’s what I was supposed to be working on: democratizing the ability to see, explore and organize rich information streams.
So as a lapsed computer scientist and failed physicist, I decided to abandon education as well and start yet a different new thing, one that was none of those and all of those together.
I asked Mike Marder if I could come back to his research group and work on tools to visualize data; we could figure out along the way how to tie it into a research plan. I had some savings (thanks largely to my Grandmother, who was just your typical successful 1940’s woman entrepreneur), so I wouldn’t cost him any money. Mike reasoned that although I didn’t know how to solve my own problems, I was frequently useful in helping others solve theirs — and who knows, I seemed really fired up about this new idea whatever it was. So all in all it was an easy decision to hide me away in a shared office and let me get to work.
Building the visualization tool required demonstration data sets to prove the concept, and there are few better than the ocean of numbers around Major League Baseball.
In addition to the retrosheet project — the history of every major-league baseball game back to the 1890s — MLB.com was publishing one of the most remarkable data sets I knew of. For the past seven years, it gives every single game, every single at-bat, every single play, down to the actual trajectory of every single pitch. I first started playing with the retrosheet data, and found some scattered errors — things like a game-time wind speed of 60mph.
(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support.)
Weekend Project Gone Awry
Well, the NOAA has weather data. Lots of weather data. The hour-by-hour global weather going back 50 years and more, hundreds of atmospheric measurements for every country in the world, free for the asking. And the Keyhole (now Google Earth) community published map files giving the geolocation of every current and historical baseball stadium.
So if you’re following, we have:
- A full characterization of every game event
- … including the time of the game and the stadium it was played in,
- … and so using the stadium map files, the event’s latitude and longitude
- … and using that lat/long, all the nearby weather stations
- … and using the game date and time, the atmospheric conditions governing that event
I connected the data sets looking to correct and fill in the weather data, and found out I accidentally wired up a wind tunnel. There’s no laboratory with the budget to have every major league pitcher throw thousands of pitches for later research purposes — none, except the data set I described.
What’s screwy (and here’s where every practicing data scientist groans and shakes their head) is that the hard part wasn’t performing the analysis. The hard parts were a) making that data useful, and b) connecting the data sets, making them use the same concepts and measurement scales.
But all that work — the mundane, generic work anybody would have to do — just sat there on my hard disk. If I created a useful program, or improved an existing public project, I knew right where to go: open-source collaboration hubs like sourceforge or github. But no such thing existed for data. I had to spend weeks transforming the MLB game data into a form that you could load into a database. If we could avoid that repetition of labor, we would solve the problem of every practicing data scientist.
On Christmas Day 2007, I bought a book on how to build websites using the “Ruby on Rails” framework, and figured I’d knock something useful out in, y’know, a week or so. By sometime that Spring, I had something useful: a few interesting data sets and a website to generically host and describe any further data sets. The initial version of the site was read-only, because I didn’t know how to do join models or form inputs in Ruby on Rails, but I could add new data sets directly to the database. And just like that, Infochimps was born.
One of the individuals who emailed to encourage us was Jeff Hammerbacher, founder of the data team at Facebook. Chatting on the phone with him, he told me about a new data analysis tool that Facebook was using, called Hadoop. I looked into it, but couldn’t see how I would ever need to use it. Still, it was really exciting that big names in data were taking interest.
On a trip to San Francisco a few weeks later, I went to a meetup at Freebase. @skud, their community manager, recognized that Infochimps was the perfect raw-data complement to Freebase. She asked me to come back the next month and give a meetup talk. Kurt Bollacker, head of their data team (and future teammate and profoundly valuable mentor), asked me to come back the next day and give an internal lunch lecture. I stayed up all night using google docs on my uncle’s powerpoint-less computer, and gave some hot mess of a presentation to their internal group. Kirrily didn’t uninvite me, so it wasn’t too bad.
It was clear that the lack of a collaboration hub was a problem many people were feeling.
So as a lapsed computer scientist, failed physicist, and no-show educator, I decided to abandon working on a visualization tool and make a collaboration hub instead. Yup.
(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it.)
One of the new faces on Mike’s research team when I returned was Dhruv Bansal, who was working on a fascinating problem bridging Mike’s two interests: physics and education. They used a freedom-of-information request to acquire a fascinating data set: the anonymized test scores for every student, on every question, for the yearly exam taken by every schoolchild in Texas.
They used the physics equations for fluid flow to model the year-on-year change in student test scores, highlighting patterns that demanded immediate action within the education community.
As you can guess again, the costliest part of that project was not performing the analytics; or applying the Fokker-Planck equation for fluid-flow; or working the paper through peer review. No, the costliest part of the project was the 3-month process of acquiring the data and cleaning it for use. For the random researcher who discovered and requested the data, Dhruv would spend a few hours burning the data to a DVD and physically mail a copy. For reasons I still don’t understand, while researchers in Sociology, Psychology, other “soft” sciences immediately latched on to the usefulness of Infochimps from the very start, Physicists and Computer Scientists almost never understood what we were doing or why it might be valuable. Dhruv and Mike’s split focus meant they got it immediately.
This is probably the most unlikely lucky break, and most crucial development, of this adventure: sitting a few offices away from where I worked was one of the most talented programmers I’ve ever worked with, possessed with a mountainous drive to change the world, the laconic cool to keep me level, and a furious anger at the same exact problem I was working to solve.
(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it; sharing the same advisor as Dhruv.)
At around this time Twitter was blowing up in popularity, though still a tool largely used by nerds to tell each other about what they had for lunch. We couldn’t explain, any more than most, the appeal of Twitter a social service.
But to 2 physicists with a background in the theory of random network graphs, Twitter as a data set was more than a social service, it was a scientific breakthrough. It implemented a revolutionary new measurement device, giving us an unprecedented ability to quantify relationships among people and conversations within communities. Just as the microscope changed biology, and the X-ray transformed medicine, we knew seeing into a new realm places us on the cusp of a new understanding of the human condition. Making this data available for analysis and collaboration was the best way to provide value and draw attention to the Infochimps site. We emailed Alex Payne, engineering lead at Twitter, for permission to pull in that data and share it with others. He gave me a ready thumbs-up: better that scientists download the data from us, than that they pound it out of his servers.
We wrote a program to ‘crawl’ the user graph: download a user, list their followers, download those users, list their followers, repeat. That was the easy part. Sure, each hundred followers had hundreds of followers themselves, but we could make thousands of requests per hour, millions of requests per week.
The hard part came over the next few weeks as we realized that none of our tools were remotely capable of managing, let along analyzing, the scale of data we so easily pulled in. As quickly as we could learn MySQL, the data set outgrew it. Sure, Dhruv and I could request supercomputer time for research, but supercomputers weren’t actually a good match — they’d be more like a rocketship when what we needed was a fleet of dump trucks. We realized what we needed was Hadoop, the tool Jeff Hammerbacher mentioned to me a few months earlier.
But where could we set up Hadoop? The physics department’s computers were scattered all over and largely locked down. But I also had an account on the UT Math department’s computers. Their sysadmin, Patrick Goetz, was singularly passionate about enabling researchers with the tools they needed to make breakthroughs. He took the much more courageous (and time-consuming for him) route of allowing expert users to install new software across departmental machines.
What’s more, the Math department had just installed a 70-machine educational lab. During the day, it was filled with frustrated freshman fighting Matlab and math majors making their integrals converge. From evening to 6am, however, they were just sitting there… running… inviting someone to put them to good use.
So that’s what we did; put them to good use. We set up Hadoop on each of the machines, modifying their configuration for the comparatively wussy undergrad-lab hardware, and set about using this samizdat supercluster on the Twitter user graph.
(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it; sharing the same advisor as Dhruv; the explosion of social media data; the invention of Hadoop.)
All through 2006-2009, people walking different paths — social media, bioinformatics, web log analysis, graphic design, physics, open government, computational linguistics — were arriving in this wide-open space, forming communities around open data and Big Data.
On Twitter, we were finally seeing what all the people in our favorite data set knew: a novel communication medium that enabled frictionless exchange of ideas and visible community. I’ll call out people like @medriscoll (CEO of Metamarkets), @peteskomoroch (Pricinpal Data Scientist at LinkedIn) @mndoci (Product Manager of Amazon EC2), @hackingdata (Founder of Cloudera, now professor at Mt Sinai School of Medicine), @dpatil (everything), @neilkod and @datajunkie (Facebook data team), @wattsteve (Head of Big Data at Red Hat), among dozens more. It didn’t matter if someone was a random academic, a bored database engineer, a consultant escaping one field into this new one, a big name building the core technology. When you saw a person you respected talking to a person with a good idea, you hit “follow”, and you learned. And when you heard that someone in the Big Data space wasn’t on Twitter, you harangued them until they joined. (Hi, Tom!)
Meanwhile, Aaron Swartz had started the Get.theinfo Google Group. This most minor of his contributions had a larger impact that most know, and was typical of why he’s so missed. He recognized a problem (no conversation space for open-data enthusiasts), built just enough infrastructure to solve it (a google group and a website), then galvanized the community to take over (gifting enthusiastic members with the white elephant of moderator permissions), and offered guidance to make it grow.
The relationships we built and communities we joined became critical catalysts for our growth.
We spent the next several months building out the site during the day and running analysis on the growing hundreds of gigabytes by night (does that seem quaintly small now?). Right before Christmas break, we did a set of runs producing data suitable for people in the community to find useful. Hours before hopping on the plane to visit my family, I finished compressing and uploading them, wrote up a minimal readme file, and posted a note to the Get.Theinfo mailing list. I knew the folks there wouldn’t mind the rough cut version, so I figured I’d mention it quietly there, but wait to do a proper release after break — after all, there was no internet where I’d be staying.
Well, two predictable things happened: 1) a huge response, far more than expected, flowing up the chain to large tech blogs and twitter-ers, and 2) a polite but forceful email from Ev Williams (Twitter’s CEO) asked us to take the data files down while they figured out a data terms-of-service. We reluctantly removed the data.
Sure, the experience was a partial success. It brought great publicity, and of course you probably caught the foreshadowing of how important Hadoop was about to become for us. But we failed at the important goal, sharing this immensely valuable data we invested months to release.
Minister of Simplicity
Now to introduce Joe Kelly into the story. Our research center decided to hire someone to build our new website, and one of the respondents to our Craigslist ad was Joe, a former UT business school student who had been working with his roommate to get their general contracting firm off the ground. He didn’t really know how to design websites, but he absolutely loved reading about the science our center was doing, so he applied.
His interview was amazing. He had the design sense of a paper bag compared to the other candidates, but every one of us left the room saying, “wow, that guy was awesome, the kind of person you just want to work with on a project”. Only Dhruv was smart enough to take the face-slappingly obvious next step — replying 1-to-1 to a later email from Joe to say, “well, hey, we also have this other project going on; we don’t really want need your help on the website, but there’s a lot of work to do”. Within days, Joe had set up a bank account and PO box, organized the papers to make us an official partnership, and generally turned this ramshackle project into an infant company. It was an easy decision for Dhruv and I to make him a co-founder.
An easy decision until a few days later, when I read some cautionary article about how the #1 mistake companies make is choosing co-founders hastily. Well, hell. We just made this guy we randomly met a couple weeks ago a co-founder, handing him a huge chunk of the company. I didn’t know if we just made a huge mistake or not.
So the next day, we were hanging out at the Posse East bar (our “office” for the first several months of the company), and Joe introduced us to the idea of an Elevator Pitch. “If we’re going to be at the South by Southwest (SXSW) Conference, we need to be able to explain Infochimps”. I replied with some kind of rambling high-concept noodle. Dhruv rang in with his version — more scientific, more charm and cool, but no more useful than mine.
Joe replied, “No. What Infochimps is this: ‘A website to find or share any data set in the world’”.
I rocked back in my chair and knew Dhruv and I made one of the best decisions of our lives. His version said everything essential, and nothing more. In one week, he understood what we were doing better than we did after a year. Joe’s role emerged as our “Minister of Simplicity”. He removed all complications, handled all necessary details, smoothed all lines of communications, making it possible for our team to Just Hack. Everything essential, and nothing more.
With the decision to move forward as a company, not an academic project, we applied to the starting class of Capital Factory (Austin’s startup accelerator). It was an amazing experience, and we went hard at it: we hit all the meetings, spent hours working on our pitch, tried to make contact with every mentor, and made an epic application video. (One of Dhruv’s housemates was a professional filmmaker. Friends in high places.)
We got great feedback and obvious interest from the mentors, and were chosen as finalists. We were confident that we had the right combination of team and big idea to merit acceptance.
They rejected us.
After the acquisition, Bryan Menell — one of the Capitol Factory founders — posted a graciously bold blog post explaining what happened. As we later heard from several mentors, they each individually loved our company. Once in the same room though, they found that none of them loved the same company. This mentor loved Infochimps, a company that would monetize social media data. This other one loved Infochimps, a set of brilliant scientists who could help businesses understand their data. Some of them just knew we worked our asses off and were incredibly passionate about whatever the hell it is we were doing but couldn’t explain. A few of the mentors loved Infochimps because we were building something so cool and potentially huge that surely some business value would later emerge. Whichever idea a mentor did like, they generally didn’t like the others.
I can’t overstate how difficult it was to explain what we were doing back then. After two years, we can now crisply state what we had in mind: “A platform connecting every public and commercially available database in the world. We will capture value by bringing existing commercial data to new markets, and creating new data sets from their connections.” It’s easy(er) now, partly because of the time we spent to crystallize an explanation of the idea. Even more so, people now have had years of direct experience and background buzz preparing them to hear the idea. For example, the concept that “sports data” or “twitter data” might have commercial value was barely defensible then, but is increasingly obvious now.
Above all that though, the Capital Factory mentors were right: we were all those ideas, and all of those ideas were (as we’d find out) mostly terrible. And working on the combination of all of them was a beyond-terrible idea. On that point, Capital Factory was right to reject us.
We worked hard, had the perfect opportunity, and failed.
For good reasons and bad, we failed to get in, Or, well, we mostly failed to get in. Some of the mentors liked what they heard enough to stay in touch — meeting for beers and advice, making introductions, and being generous with their time and contacts in many other ways. The Austin startup scene was about to explode, led by Joshua Baer, Jason Cohen, Damon Clinkscales, Alex Jones and others. The energy that the Capital Factory mentors and these other leaders put into mentoring startups like ours ricocheted and multiplied within the community, in the kind of “liquid network” that Steven Johnson writes about. Although the companies within the first CapFac class benefited the most, it was like every startup in Austin was admitted.
On the one hand, we had a bunch of fans in blog land, some website code, and a good team. But we had no idea how to make money and a finite runway. Our most notable validation as a project was a failed effort to share data, and our most notable validation as a business was an honorable mention ribbon.
Are you seeing it?
We were experiencing success after success after success.
Every time we failed, a smaller opportunity opened: one that was sharper; one that was more real; one that brought us closer to the right leverage point for changing the world.
These opportunities were smaller, but the energy behind them was the same. We were following what inspired people — to use data sets from Infochimps, to post a data set, to join our pied-piper team, to tweet about us, to make an intro, to have coffee and teach us something. All our ideas were useless crap, except in one essential way: to gather and inspire the people who would help us uncover a few ideas that were good, and execute on them.
(Lucky break scoreboard: most patient graduate advisor ever; financial safety and family support; incipient critical mass of public data sets; new breakthroughs in the world; big names taking interest in the project and deciding to market it; sharing the same advisor as Dhruv; the explosion of social media data; the invention of Hadoop; the completely random intersection with Joe; starting Infochimps just as the Austin startup scene exploded.)
The 3rd part of this blog series will highlight the journey from “project that inspired people” to “business that solved a real problem” — powered by individuals who made sizable investments of time, energy, money and kindness to produce repeated successes from repeated failures, and by the early customers of Infochimps who believed in us.
As we go, that “lucky break scoreboard” will get more and more improbable, enough to make that word “lucky” ludicrously inapplicable.
Philip (Flip) Kromer is co-founder and CTO of Infochimps where he built scalable architecture that allows app programmers and statisticians to quickly and confidently manipulate data streams at arbitrary scale. He holds a B.S. in Physics and Computer Science from Cornell University and attended graduate school in Physics at the University of Texas at Austin. He authored the O’Reilly book on data science in practice, and has spoken at South by Southwest, Hadoop World, Strata, and CloudCon. Email Flip at email@example.com or follow him on Twitter at @mrflip.