In this series, I will post my views on topics related to Data Science - probably the most fascinating and least understood field in IT. It's like an elephant in a dark room and when most tenants have not seen the animal in a day light they tend to believe that the tail is the elephant or the horn is the elephant or whatever. Fortunately the developments in this area have knocked off the "Big Data is a hype" rhetoric and we are ready to industrialize the data science. I have no doubts that Data Science will become the foundation of technological revolution we are about to witness.
So what is data science? To me, Data Science is an interdisciplinary field that employs sophisticated tools and techniques to extract knowledge and actionable insights from structured or unstructured data in order to optimize business objectives. I will not insist on a definition though, it simply does not matter. What matters, however, is understanding the building blocks, their compositions, and possibilities beyond the obvious. In this article I will draw the four major components of Data Science. Data, Science, Technology, and Business. Subsequent articles will discuss each topic in further details.
Data is the food for the data science. It can be small or big. By the way, the term "big" is relative anyway, so I wont go in a debate. What matters is the data or as dubbed in a more sensible term "smart data". While the four famous V's (Volume, Velocity, Variety, and Veracity) explain the underlying landscape, its the Value that matters in the end. Feature Engineering i.e. creating meaningful/useful attributes from raw data is a key trend. Another key trend is to deal with unstructured data by embedding the feature-engineering in powerful Machine Learning models such as Deep Neural Networks.
Or the algorithms aka Machine Learning is the backbone of Data Science. A data scientist follows a rigorous process (such as CRISP-DM) to explore and analyze the data while building Machine Learning models. A machine learning model resolves a certain problem such as predicting customer churn and identifying most influential factors. Starting from neural networks in 1950's, providing sophisticated algorithms such as Support Vector Machines and Random Forests, deepening the Neural Networks in recent years, Machine Learning has not disappointed the practitioners. What is most fascinating is the immediate feedback of the model through train-validate-test process. If done properly, there is always an added value of this exploration even when the final model do not reach to desired goal.
Or the tools that put a life in those models. While the conventional spreadsheets and SQL are remain major tools, there is a lot more on the landscape - especially when the scale and rapid development is a choice. Who would have thought a few years ago that Python and NoSQL will be competing with Java and SQL, respectively. We have seen a rapid progress and adoption of open-source tools, cloud platforms, SaaS, and API's. Distributed computing and technology is being democratized and has become a norm e.g. Apache Spark, Block Chain. Building large-scale, compute-intensive, real world applications has become much less difficult - thanks to the bunch of smart and low cost sensors, powerful GPU's such as Tesla P100 and compute environments such as IBM's Power AI. Have a look at work of Matt Turck.
This is the most important and underrated aspect among many new entrants. Every now and then, I meet data science enthusiasts, new graduates and researchers with bright eyes (I used to have such a pair) who believe that being a data scientist means beating some benchmark. No! its about meeting some objective - a business objective in 99% of the cases. Yes, there are cases and situations where you will be challenged by the underlying problem and will have to exhibit the magic but that is not a starting point. Most traditional businesses are in transition phase even in digitization phase, so a lot many problems can be solved through Automation, Data Analysis and Predictive Modeling. In my short career, I have witnessed success stories across a range of applications: volume forecast, churn prediction, routing optimization, real-time-bidding, fine-grain image recognition, crop-optimization, web analysis, insurance estimation, vehicle control optimization - to name a few.