2014 data science survey out now

dscsurveybook The annual data science skills and salary survey from O’Reilly is now freely available from their website. The survey uses responses from 800 participants from over 50 countries.

Inside are comparisons of the different tools used by data science practitioners and the corresponding salary they can expect to earn. The data is also cut by geographic location, career level, academic record, and industry type amongst others.

A lot of the key findings are expected: R, Python, and SQL are the most widely used tools; top USA salaries are in California. But some results are more surprising: Spark has emerged as a popular tool in 2014; the ‘Entertainment’ industry boasts the highest median salary for data scientists.

Highlights in this edition include a cluster analysis of the tools used, which showed the emergence of a new cluster around Max OS X, MySQL, and D3. There is also salary regression model which puts a dollar weight on geographic, demographic, and company predictors to give an in-sample R2 of over 50%.

A shame the number of respondents is so low but all in all a good read to give a directional sense of the state of play in 2014 and what might be up and coming in 2015.

When to wait for flight prices to drop

Bing Price Predictor

Bing Flights Price Predictor

Kayak Price Predictor

Kayak Flights Price Predictor

I’ve often heard people talk about when is the best time to book flights (apparently its Tuesday nights). And there has been a rise in airfare blogs such as Airfare Watchdog and CheapAir’s Blog.

Even online flight booking platforms such as Bing and Kayak are starting to offer advice on whether prices are trending up or down and whether now is the best time to buy.

Model Parameters Value Over Time

Model Parameters Values Over Time

Recently, I came across a dataset of about 6 months worth of internal US flights prices data. For about 100 popular routes, the dataset had the time and current price for the future flight. I wanted to see whether we could actually predict directional changes in price with any confidence.

I built a model to try to predict whether the price would drop by at least 10% in the next 7 days. Using only historical price returns and weekly updating of the model parameters, I calculated the daily out-of-sample performance. The results were much better than I expected.

Model R2 In Test Data Over Time

Model R2 In Test Data Over Time

Firstly, the 2 parameters in my model were reasonably stable over time – a key property of a well defined model. And secondly, the out-of-sample R2 (measure of performance) was consistently positive and around 5%.

More concretely and actionable: for the dataset I was looking at, the price actually dropped 18% of the time (to below 10% in the proceeding week), the model made a prediction that the price would drop 13% of the time, and it was correct in 73% of these predictions.

With more features data such as flight duration, number of changes, oil prices, seasonality i’m confident that the 13% could get closer to 18% and the 73% could be pushed even higher, maybe to 95%.