Data Science in Practice

Data Science - The Ingredients

Published on October 1, 2017


Dr Shahzad

In the previous article, I have listed the four building blocks of Practical Data Science i.e. Data, Science, Technology, and the Business Application. In this article we dig one level deeper and discuss some of the ingredients across these quadrants, as show in figure below.

Disclaimer

  • The set of ingredients is not meant to be comprehensive (otherwise not mentioning KNN algorithm alone is almost an intellectual crime).
  • The technology stack is personal experience-biased and also Python-biased which is definitely the most popular language among Data Scientists with IT background.
  • Some elements are mentioned at an abstract level (such as Cloud) while others on the concrete (e.g. Python libraries such as Pandas). Moreover, some elements are exemplary for their class e.g. Tensor Flow representing deep learning libraries.
  • Each circular region can be read clock-wise (from the author's point of view).

What is apparent from the figure above is that Data Science is a mess vast and challenging field, and so is the set of tools and techniques. Based on their level of complexity and applicability, I have divided them in three layers, represented by three circles. The inner most circle enlists the tools that belong to the core of the practical data science and can be mastered quickly. This set could be a starting point for an entrant into the field. Obviously, it's difficult to master everything on the landscape. A full-stack data scientist will come across many of those elements over the time. I will discuss a little bit of each quadrant in the following subsections.

Data

Data is collected in many forms as we know. We also know that more than 80% of the data is unstructured. Still, for most business cases, the most valuable and readily available data is in the form of comma separated files (CSVs) and spread sheets (e.g. XLSX) - no shame in that. Often silos of data sources and manual processes leads to a data set that needs a lot of per-processing by the qqqdata scientist. On a higher level, SQL (and NewSQL) provides connection to databases and a best practice is to get as much concise data as possible. I was once handled a several gigabyte XML file with nested schema, all I had to do was to ask the provider if he could save some of my time by providing a CSV or by providing a connection to the database - it worked. Nevertheless, one should to be ready to deal with such semi-structured data. NoSQL is the new reality for its scale-ability and flexibility. A Data Lake, an enterprise storage of raw and transformed data, is the goldmine for a data scientist - and for the enterprise as well.

Science

Luckily the science of data science is data-driven. That is, with a few exception, a data scientist can always calculate and visualize how an algorithm has emerged into an optimal solution. The core in this case comprises concepts from the basic linear algebra (e.g. matrix operations), statistics (such as linear regression), and probability concepts. The next layer is the whole bunch of popular machine learning algorithms often further classified into regression, classification, clustering, and dimensional reduction algorithms. Most classification algorithms have regression-type cousins as well including Classification and Regression Trees (CART) and Support Vector Machine (SVM). This layer is probably the most popular one for the data scientists and is rightfully so, since when applied exhaustively properly the machine learning transformations (e.g. clustering and feature selections) and approximations (e.g. regression) lead to exciting results. Machine Learning cookbook does not contain all the recipes, there are situations where customized, sub-optimal, greedy algorithms, and the basic (non-) linear optimization will serve the purpose. In fact, most machine learning algorithms boils down to some optimization anyway.

Similar to the common practice of early fusion i.e. combining data from multiple sources, late fusion of multiple predictive models (over different data, model type, and parameters distribution) can also yield an improved performance in complex problems. Deep learning - the re-emerged science of neural networks - is the state-of-the-art especially on unstructured data. Further discussion on the Machine Learning algorithms , I would like to save for another article. The concluding point is that, there will be situations where an out-of-the-box ML model with hyper-parameter tuning will work fine, but there will also be situations where a data scientist need to innovate and come up with something more sophisticated or customized to the use case.

Technology

One of the most important lesson, I have learned over the years is never underestimate the power of basic tools such as MS Excel. This is especially applicable in exploratory analysis phase for having a feel of the raw data, transforming and fixing errors, and then verifying the results. The job of a data scientist is to solve a problem in most suitable way and not to over-complicate - there will be natural situations for exhibiting some magic. For common reporting and analysis there are number of BI Tools such as Power BI, Pentaho, Tableau, Cognos, to name a few. Then there is a range of drag-and-drop Machine Learning tools which are easy to use and quick to integrate. Some examples are SPSS, Rapid Miner, and Weka. Data scientists often have to dig deeper or customize ML algorithms for their needs. This is where a programming language such as Python (or R) becomes essential. Open source libraries such as Pandas (with its data frames) and Sklearn (with a range of ML algorithms and techniques) are powerful enough to write rapid solutions for many applications. Along the visualization curve are matplotlib, Bokeh, and many other libraries - all the way to D3. OpenCV, NLTK, Tensor Flow are some comprehensive libraries for Vision, NLP, and Deep Learning, respectively.

For distributed in-memory computing for large scale applications, Apache Spark is the tool. Spark also implements concept of Data frames, making transformation easy for pandas-users. Cloud-based data science platforms such as IBM's Data Science Experience and Databricks come with ready-to-use Spark environment. Cloud platforms such as IBM Bluemix now provide a broad set of ready-to-use cognitive capabilities as service. In such a setup a data scientist can start working on the end-product right away i.e. translations from Mockup to Minimum Viable Product to Pilot to Solution are seamless. Here is a delightful video of 13 years old Genius Tanmay Bakhsi.

Applications

Business applications of data science are everywhere and can not be comprehended in this short note. At the cored, every business needs to do some sort of reporting and has some tool for that purpose. Every other day there is a demand of an insight of a new kind or an integration of a new dimension. Then there is a whole bunch of unstructured data that is often untapped. A data scientist need to be ready to solve that, often through automation. While proper Reporting and Data Analysis can provides answers to the questions What has happened and Why, its more important for business to know what will happen given current situation, historic patterns or a possible course of action. Or in other words, building reliable prediction models (e.g. sales volume in the next month) is what differentiates a data scientist from an analyst (so to say).

The amount and nature of the data (recall four Vs), the objective, and tolerance level are some of the factors differentiating one prediction problem from the other. For instance the ratio of positive and negative classes in the Customer Churn (say 90/10) is much different form that of the Credit Fraud (say 99/1). In practice, long historic data and more varieties help. For instance forecasting volumes of sales on an online form could make use of not only past historic sales, but also data from seasonality, weather, vacations, geography of customer base, brand's social media perception, etc. I would particularly like to mention image processing as an area which has benefited from the rise of science (Deep Learning) and technology (GPUs). Digital businesses such as Google and Facebook mine vast amount of data to improve and monetize their products and services. Internet of Things and the other AI systems such as Service Robots and Autonomous Vehicles are knocking at the door - powered by data and the data science.