Friday, August 26, 2011

The Startup Genome Compass - Behind the Scenes

Today we are announcing the launch of the Startup Genome Compass, a web application designed to aid startups in assessing their performance and benchmark themselves against firms with similar properties.

In addition to the introduction of the Startup Genome Compass, we are releasing a second report that goes deeper into analysis of startups; mainly the implications of prematurely scaling their operations.

The Startup Genome Compass is a result of further research into the properties of startups, and what makes them tick. Several months ago, we published the first Startup Genome Report. The report received tremendous attention, and wide publication. Over 3,200 firms filled out our assessment, and we received excellent feedback and support from the startup ecosystem.

This blog post describes how the Startup Genome Compass was built, and some of the magic behind the curtain. Before I begin, it may be worthwhile to remind the readers about the goal of the project:

The Startup Genome project aims to increase the success rate of startups by deciphering how startups are built, and give entrepreneurs data, tools and recommendations to improve their performance. We try to put quantitative measures on almost all aspects of startup performance in order to improve decision making during the lifetime of firms.

The "heart" of our technology lies in the machine learning process used to identify the startup’s type and stage. The credit for building and fine tuning the machine learning engine goes to Ertan Dogrultan, who became an indispensable member of the Startup Genome team in June. I describe the machine learning process later in the blog post. Before that, here's a short introduction to Machine Learning.

The idea behind machine learning is simple - given a set of examples, called the training sample, an algorithm "can learn" the common patterns in these examples, and try to find these patterns in a new set of data. For example, we can give an algorithm a set of peoples' heights and genders. The algorithm can then learn that men are, on average, taller than women. When the algorithm is given a new set of data on people later, it can predict their average height from their gender.
What machine learning does best is to identify the relevant parameters that can describe a pattern in the data succinctly, learn those patterns and then predict them. In other words, it is good at identifying correlations and interpolating from data.
What machine learning does not do, typically, is explain the process that causes these correlations, or what causes what. I touch a little bit on that later, in the part titled "the usual suspects", but in general, readers of our report should always remember that correlation does not imply causation.

In the next few paragraphs, I will describe the analysis process in the order it was performed, leading to the development of the Startup Genome Compass.

Data Collection
Given the overwhelming response to our report, we were able to collect a significant amount of data. Some researchers have estimated that the Startup Genome now has the largest collection of startup data in the world. The role of our data collection in this phase was less exploratory and more targeted. The major change  from version 1 of the survey was addition of specific questions that could help identify startup type and stage. In addition, we collected more "generic" information (such as age and education information) to be able to compare our findings with previous research about the startup phenomena. Another improvement over the previous survey was making it dynamic - during the survey, the questions changed on the fly according to firms' answers, thus allowing us to ask more questions.

Machine Learning for Classification
In order to analyze a firm, we need to classify it to two types of groups - what type of firm it is, and in which stage it is. It is important to understand the difference between a behavioral stage and an actual stage of a firm.
A behavioral stage is determined according to the firm's actions, e.g., how much money it spends on user acquisition, or how many employees it has hired. An actual stage is determined by a firm's market success factors. Examples are the user growth of the firm, its total number of users and other measures. In total there are several tens of parameters used for each classification.

A classic classification process has two parts - in the first part, called training, an algorithm is given examples of exogenously classified data. This data is used to train the algorithm to be able to analyze non-training data. In the second part, classification is run on the non-classified data, and the results are tested for accuracy.
Our training data was created by our team. We manually classified several hundreds of firms for types, behavioral stage and actual stage. It was a lot of work, but the more firms we classified, the more accurate our algorithm becomes. Pandora works similarly, actually. Every song you hear has been manually classified by a trained “musicologist”, through a process that takes almost 20 minutes.
Once we had a training sample, we needed to tell our algorithm which attributes of the firm to use for training and classification. Every answer to a question in the survey is an attribute of a firm in our data. As many of them are correlated, and also not informative about the type and stage of the firm, we performed feature selection using the gain ratio metric.
We tested several classification algorithms and chose one that was highly accurate and well predicting for our data. Accuracy was determined using cross validation on our trained data.

Identifying inconsistency
Using our training data, we checked what happens when the actual stage of a firm does not match its behavioral stage. The results (described in the report) show strong evidence that premature scaling is a major cause for startup lackluster performance.
Premature scaling happens when a firm advances in one (or more) of its operational dimensions without syncing it with the rest of its operations. This behavior results in a myriad of issues, such as inability to raise money in later stages, or spending too much on marketing too early, causing customer dissatisfaction.
Our classification algorithm uses the predicted actual and behavioral stages to identify inconsistent firms. The data for these firms is then compared to other firms of the same stage and type.

Transforming math into code
Most of the analysis was done using Weka, with final touches in Excel. Once we had our classification methodology working, our developers transferred the algorithm into a web application that is able to perform both training and classification for the firms filling out the survey.

Visualization
A major investment on our part was to be able to let firms analyze and benchmark themselves against other firms. The amount of information we display to firms includes 23 measures, on top of their Type, Stage and whether they are inconsistent and a personalized research from the Startup Genome Report.
It was therefore very important that our presentation method be both appealing visually, but also informative, allowing a firm to easily separate the wheat from the chaff.
On each measure of the report, we compare each firm to the mean of other firms in its group, and display it location compared to the majority of the group, defined as 1 standard deviation from the mean.

The usual suspects
Some caveats about our process require mentioning:
  1. Applicability to a firm - Reporting average results does not necessarily mean that firms should aspire to change their behavior to match the mean. To put it in other words, you don't tell children who get an A+ in a test to try and get a B in the next time. Our application helps point out where firms deviate significantly from the behavior of the majority. We recommend that entrepreneurs use this is a tool to identify potential issues with their firm.
  2. Classification error - Since we are using statistics to classify firms, our algorithm may have classification errors. This might mean that some firms may be classified into an incorrect type or stage. In our tests, the amount of error is very small, given the large number of variables we are using. The solution to this issue is to identify the causes of these errors, and improve our survey and analysis to handle them. More data is also needed to help with this issue. Therefore, each firm that fills out the survey also contributes to the future enhancement of its own analysis.
  3. Survivability Bias - As mentioned in Jason's excellent blog post, our data is static. We do not have any firms that ceased to exist fill out our survey and tell us why they have failed. Our stages model solves this problem partially, as we are able to put a measure which is better than "dead" or "alive" on firms. We are striving to move from our static analysis of the firm to a more dynamic model of firm performance, which is where our focus will lie in future versions.


What's next for us?
We are very excited about launching the Startup Genome Compass, and we hope it will be useful to the startup community. Our current focus is on collecting more data and improving our surveying mechanism to improve the accuracy of our analysis. The next step is introducing dynamic analysis of startups, which we believe will disrupt the field of startup analysis.

Stay tuned...

Saturday, August 13, 2011

Wealth Concentration - Yes. Central Planning? Not so much.

My facebook feed was filled today with links to a post by Tim O'Reilly which itself references this post by John Robb. John theorizes that centralized planning is to be blamed for the recent hurdles the US Economy is facing.

The summary of John's hypothesis is as following:

  1. The USSR failed because of central economic decision making.
  2. Decision making in such a complex system can only be made efficiently by letting the players in the market make small parallel decisions - that is, letting the "market" work.
  3. The US was successful because it was very good at letting markets work. However:
  4. Recently, too much centralized government planning leads to misallocation of resources.
  5. But more importantly, the increasing concentration of U.S. wealth among a handful of rich people leads to even more centralized decision making - only those wealthy enough make the decisions, and they control most of the assets to decide on.
I liked the theory, as it is novel, and gives a new angle to the causes of the recent economic crisis.

I am quite positive, however, that wealth concentration is neither evidence, nor a cause for central planning and its failure in the U.S.

The reason is that wealth concentration hasn't changed tremendously in the last 90 years in the U.S.

And here's why.

As support for his claim about wealth concentration in the U.S., John shows the US Income distribution graph from the NYTimes Economix blog:


If you don't like reading graphs, the one line summary is: "The top rich 0.1% make much much more than all the rest". This indeed shows that income is very unequal in the U.S.

To verify John's hypothesis, we can do three things:
  1. Focus on wealth, and not on income. If the claim is about wealth concentration, why look at income and not total assets owned by people?
  2. See if it changed over time - suppose the U.S. was always unequal about its wealth distribution - in that case, the recent economic issues are probably not a result of that inequality.
  3. Compare it to other countries - If there are other countries, with unequal wealth distribution, and even central planning, do they also face similar problems?
To answer these questions, we need data. Luckily, this page by Sociologist Bill Domhoff from UCSC has all the information required to answer these questions.

So let's start:
  • Is wealth concentrated by a handful of people in the U.S.?
The probable answer is Yes, as can be seen from this NYTimes Economix Graph:


This graph shows that the top 1% of U.S. population controls between 20%-30% of the wealth in the USA.

  • Did wealth concentration change drastically in the last 80+ years?

The probable answer is No.

The following graph, with data as old as 1922, shows that the wealth concentration in the U.S did not change dramatically over the last 25 years or so. To add on that, there is less wealth concentrated in the U.S. by the top 1% today than there was during the 1990s, and roughly the same amount as in the 1960s and 1930s. One caveat is that the data goes only up to 2007, but probably hasn't changed much:

Source: Adapted from Wealth, Income, and Power

This means the U.S. was always unequal in its wealth distribution. 

It's not a recent phenomenon.

  • So how does the U.S. compare to other countries?

The data I found was for the year 2000 (I didn't look too much, though), but let's assume no dramatic changes have happened in wealth concentration since 2000. Also, changes did happen, they probably went in the same direction for most countries. The data is for the top 10% of the population, not the top 1%:



wealth owned
by top 10%
Switzerland71.3%
United States69.8%
Denmark65.0%
France61.0%
Sweden58.6%
UK56.0%
Canada53.0%
Norway50.5%
Germany44.4%
Finland42.3%

The table clearly shows that although the U.S. has most of its wealth concentrated in the top 10% of the population, this is also the case for Switzerland, Sweden, France, and Denmark. 
All of these countries have very centralized planning. 
Many of them have very high tax rates compared to the U.S.

Some of them are in bad shape.

But Switzerland seems to be doing well, no matter what. Especially given its wealth concentration.

Conclusion: We can't really conclude anything from this short exposition of data, but it looks probable that wealth concentration did not cause more centralized planning which did not cause the recent difficulty the U.S. economy is facing.

Wednesday, May 25, 2011

Deciphering the genome of ... startups!

Over the last three months or so I had the pleasure of working with the talented team of blackbox.vc (http://blackbox.vc/) on the startup genome project (SG).

The goal of the project is extremely ambitious - to map, model and analyze what makes startups tick, what helps them succeed and why many of them fail. We are hoping that the insights generated through this project will create useful tools for increasing success rates of startups during their initial growth periods, as well as will shed light on interesting phenomena, pitfalls to avoid and much more.

My part in this project was to aid with the technical and scientific analysis of the data, anywhere from initial definitions of the questions to be answered up to giving input on what conclusions can be drawn using different statistical techniques. An academic marketing background is excellent for this purpose, as it brings in tools from survey design, econometrics, some psychology and more.

[Source: 12manage.com]
The process was fun (and is still going on), but initial results will appear any day now first report we publish. One interesting conclusion is that many firms today, especially young startups, own hordes of data they are not sure how to handle. This is similar to a person striking an oil field, but without any refining technology in sight to turn oil into usable fuel. The firms sit on gold (or oil) in the form of data, but cannot tap into this resource because of uncertainty on how to attack such a problem.

I think there is a lot to be learned from how we analyzed the data and the methodological process that happened while writing our report, and this post is aimed at telling the story. If there is any specific part or topic which you are interested in, leave a comment in the comments section, or contact me, and I'll do my best to elaborate more. If you're interested in future updates, just follow me on Twitter.

As in any genome project (as if I was ever part of one), there are three main parts to this project:

  1. Mapping - Initially, we were first tasked with a simple question - we had lots of data, some of it good, some bad, some accurate, some not - what is there that is useful?
  2. Modeling and Hypothesizing - Once we had a better grasp of the data we had in our hands, we needed a way to think about the questions we would like to answer. In order to ask smart questions, we needed a simple way to describe the process startups go through, if they go through a standard one at all.
  3. Analysis and Reporting - Given a model and the data, we set upon checking our hypotheses and drawing conclusions from the validated ones. We ended up with tons of numbers, tables, graphs, equations and what not. The final goal was to somehow synthesize it all into readable content, which is also (hopefully) actionable.

Friday, March 4, 2011

How the industry of innovation makes itself obsolete

There's a lot of buzz lately around Silicon Valley about angel investors, startup accelerators, crowdfunding, business model development and much more.

A typical claim is that the VC business model is broken. Another one is that VCs have just not adjusted to the modern needs of their companies.

As I am not in the valley, I watch the changes somewhat from afar, but not too far. This gives (IMO) an excellent perspective on things, and one of the interesting phenomenon is how the VC industry is funding its own demise.

If you're reading this blog and know nothing about VCs (venture capitalists), here's a short explanation: VCs take money from investors, and invest it in startups. Their (claimed) expertise is allocating investments smartly in a way that funds groundbreaking innovations that bring huge returns to their investors.

The industry started in the early 1960s (late 50s even), and had funded probably every large and innovative technology company you have heard of or not, including DEC, Apple, Cisco, Google and many more.

In the last 5 years or so, a big shift in the industry can be observed - much smaller investments are needed to start a company, and VCs many time stay outside the game of the smaller company creation and exit process.

Friday, December 24, 2010

Self inflicted paranoia - The "secret" of insurance deductibles

It is a well known fact (or at least strong belief) that most people make very wrong choices with respect to buying insurance products. The majority of people over-insure, that is, pay much more than they should, or just neglect to purchase insurance altogether.

This post turned out to be a bit long and somewhat technical, but here's my promise - if you can keep with it until the end, the chances are you will feel fooled by insurance companies, but will also be able to save lots of money in the future.


As someone who studies consumer decisions, buying insurance is one of the most fascinating phenomena to study, for several reasons:

  1. Insurance is a complex product, containing many details, but at the end, there is just one price (premium) to pay.
  2. Consumers buying insurance need to make many decisions about many "parameters" of the product they buy.
  3. Insurance sells something in the future (coverage against negative events), that might happen or might not - who can predict his own future with accuracy?
  4. The product sold is very emotional - it is related to negative events with big and bad impact. Most people have a hard time entangling their emotion about the event itself and the decision of how much coverage to buy.
  5. The insurance industry (in the US) is rather competitive, so the products are abundant, advanced and should be fairly priced.
  6. People make the decision to buy insurance over and over again, many times annually for a period of 30-40 years. This means there is ample time and information for learning from past mistakes.
I am fascinated by this phenomenon, since it shows how firms produce a complex, emotional product and profit on consumers' inability to understand too many details, or their plain fear from negative events.

Wednesday, December 22, 2010

Are VCs the new recruitment agencies?

Roughly a month ago I had a lovely brunch with two friends, both currently in the online advertising industry.
Our chat touched on many topics, from crazy Halloween parties to China's media agency.

If you ask yourself "Huh?", the answer is Yes - they are connected in some twisted way.

We finally resolved to talking about the recent burst of small startups, small investments and multiple small exits of companies occurring all around the bay area. Are these signs of a new "tech bubble"?
Some attributes of the current frenzy are similar to previous tech-crazes, but one is unique - the exits are not IPOs but rather many small acquisitions by larger firms, and the investors are not "regular" people doing it through the stock market - they are "sophisticated" angels and acquiring firms.

At one point, I raised my hypothesis that it appears that investors are recently being "exploited " as recruitment agencies by entrepreneurs and acquiring firms.

And it goes like this.

Once upon a time, perhaps 5-10 years ago (or more), founding a startup required significant initial capital. Getting the business to grow and to become profitable, or even with decent traction required on the order of $10M-$20M if not more, except for certain stellar cases.

This caused two phenomena - entrepreneurs had to either raise considerable amounts of money to start a company, or join a large company with a strong financial backing to create their dream products. It was not possible to just "create it and see if it works".

Wednesday, November 17, 2010

Talent Acquisition vs. Business Acquisition

Over the years I have been lucky to meet many smart people. I consider myself even luckier by the fact that many of them have become my friends.

If you are engineers, or worked at a startup or a similar environment before, you are probably aware that the 10:1 employee is not a myth. A 10:1 employee is that employee who performs like ten others. They code faster, make less mistakes, design smarter solutions, solve problems better, the customers love them, and they are as flexible and artistic as can be. To put it shortly, they are super-persona of sorts. For the mathematically inclined, if we assume an average person has an output of 10 units, and assume that 95% of the population perform at most 50 units (5:1), then the probability of a person being a 10:1 employee is less than 1 in 10,000. (I cheated a little on the math, don't kill me). When I say "smart", I mean 10:1 "smart".

The interesting thing about smart people, at least in the business world, is that the fact they are rare yields a high premium for their services. In other words, companies are willing to pay a lot to hire them or receive their services.