Check out my new Machine Learning blog post on Airbnb


While almost all members of the Airbnb community interact in good faith, there is an ever shrinking group of bad actors that seek to take advantage of the platform for profit. This problem is not unique to Airbnb: social networks battle with attempts to spam or phish users for their details; ecommerce sites try to prevent the use of stolen credit cards. The Trust and Safety team at Airbnb works tirelessly to remove bad actors from the Airbnb community and to help make the platform a safer and trustworthy place to experience belonging.

Missing Values In A Random Forest

We can train machine learning models to identify new bad actors (for more details see the previous blog post Architecting a Machine Learning System for Risk). One particular family of models we use is Random Forest Classifiers (RFCs). A RFC is a collection of trees, each independently grown using labeled and complete input training data. By complete we explicitly mean that there are no missing values i.e. NULL or NaN values. But in practice the data often can have (many) missing values. In particular, very predictive features do not always have values available so they must be imputed before a random forest can be trained.

Read more…


2014 data science survey out now

dscsurveybook The annual data science skills and salary survey from O’Reilly is now freely available from their website. The survey uses responses from 800 participants from over 50 countries.

Inside are comparisons of the different tools used by data science practitioners and the corresponding salary they can expect to earn. The data is also cut by geographic location, career level, academic record, and industry type amongst others.

A lot of the key findings are expected: R, Python, and SQL are the most widely used tools; top USA salaries are in California. But some results are more surprising: Spark has emerged as a popular tool in 2014; the ‘Entertainment’ industry boasts the highest median salary for data scientists.

Highlights in this edition include a cluster analysis of the tools used, which showed the emergence of a new cluster around Max OS X, MySQL, and D3. There is also salary regression model which puts a dollar weight on geographic, demographic, and company predictors to give an in-sample R2 of over 50%.

A shame the number of respondents is so low but all in all a good read to give a directional sense of the state of play in 2014 and what might be up and coming in 2015.