Data Science - a Journey to Adoption
By Nikolay Novozhilov, Chief Data Scientist, Wego
It all started with a hype around big data, Data Science and Machine Learning. Business was taking by example public success stories from competitions like Kaggle. In a competition, the data is prepared, and the result is coming as a score on some private Leaderboard. All their time participants spend doing data science – feature engineering, training models, cross-validating, and more. Looking at it, many organizations started hiring Data Scientists.
By now everybody realized that getting enough clean data is the real problem and solving it takes 95 percent of the efforts. Most data scientists end up doing data engineering work - data collection and data cleaning. If you are still hiring data scientists, stop and think! Chances are you need more data engineers.
I feel that even getting clean data is just the tip of the iceberg. Execution is the base of the iceberg. I see three main issues data scientists encounter while trying to bring real value to the company.
Business is about making decisions. You either make decisions with your gut feeling or with data. We believe now that making decisions with data is more efficient. The investment in Big Data and Data Science pays off only if day to day decisions in your business start to rely more on data and less on expert opinions.
Following the paradigm of academia and competitions, data scientists focus on prediction, but companies don’t need predictions. A user comes to my website, and my machine learning model predicts that he is going to leave. I don’t need to know that! I need to know what can I do to make him stay and buy.
It is useful to bring domain expert early in the research. Not because they can help with predictive models, generic algorithms are doing just fine. The domain expert knows better what levers he can pull and will help to make predictive model actionable.
Decisions require trust. I often see colleagues sharing the same frustration - “business people must trust the data” and they don’t. I agree on the point that decisions with data are better. However, I also understand the problem that business people have with our data and models. I doubt that there is one data scientist that can earnestly say that his data is 100 percent clean and accurate. Let’s admit it – such thing doesn’t exist! And it doesn’t make it easier when you put on top a “magical” machine learning algorithm that is a black box even for data scientists themselves.
The investment in Big Data and Data Science pays off only if day to day decisions in your business start to rely more on data and less on expert opinions
Very often data scientists ship their models or dashboards “as it is” and leave it for users to explore. Any business person will browse the model looking for something surprising. And after they find it, the first question is – “is the data correct?” Even if only 10 percent of the times it is a data issue and other 90 percent - a serious business insight, still the first impression will be “data is unreliable.” If you built the model, analyze the data yourself, as a business person would. Find the surprising facts, check them to exclude data issues and only then share the model and the key insights from it.
Model interpretation also contributes to building trust. Make sure that you develop tools that help to interpret your model’s decisions. For example, if you have built a recommendation engine, create an internal tool that explains what factors each product up in the list. It will also help in debugging.
Very often data scientists build a predictive model in Python or R and get remarkable accuracy, but this code can’t run in production at the real scale. In the end, everything gets rewritten in Java or Scala, and half of the model gets scraped due to complexity.
Different problems come when you are optimizing some manual process. For example, marketing campaign optimization can still be done manually in your organization. You might think that a proper dashboard with bidding advice will be enough for deployment. In practice, most of your model improvements will come from using more features for targeting and faster updates of your bids. This will make manual optimization impossible and you will end up automating the whole process. It is very likely that developing a proper automation tool that manages your AdWords or Facebook ads through API will require more time and effort than building the model.
The importance of execution is becoming more and more apparent, specifically when both data cleaning and machine learning are getting excellent automated tools. What does it mean for the big picture? Many specialists are predicting that data science will be automated and the profession will go away. I think it is only partially true. There is no place for a silo Data Science department in future organizations. However, data scientists will not go extinct. They will march over the fence into domain specific roles. There is a huge need for marketing managers who understand data; product managers who understand data; developers who understand data; HR managers who understand data; and everybody else who understands data. Data Science department will disappear by taking over the whole organization.