Saturday, December 1, 2018

Machine Learning and Data Science

We've used Apache Spark to extract insights from Kaggle's Survey "The State of Data Science & Machine Learning" where Kaggle gathered more than 16,000 responses.

For us, the main objective of this small project was to learn more about the state of Data Science, as an interdisciplinary field. That is why we've used the publicly available dataset from Kaggle. In addition we also wanted to experiment with Apache Spark thus we've used this tool to process and extract the insights presented below.

Gathered insights

  1. Unfortunately the existing gender imbalance from the technology field is also present in the data science field. Only 16.9% of respondents are female which leads us to believe there is a big difference between the number of female data scientists and male data scientists.
  2. Python is the most recommended language to learn first for people aspiring to become a data scientist.
  3. Kaggle is one of the most popular learning platforms for data scientists.

Who are the data scientists?

When talking about gender, out of the 16,000+ persons, 82.6% are male data scientists and only 16.9% are female data scientists. And 0.5% chose not to answer the gender related question.

Most of the data scientists who answered are under 35 years old. Distribution can be seen in the below bar chart. In addition 10% of the data scientists who answered 60 years old or over.
Looking at the countries of the data scientists who responded the survey we can see that 25.1% of them are from United states, 16.2% are from India, 3.5% are from Russia and 3.2% are from United Kingdom. As seen below, the rest of 52% of the data scientists are from other countries such as Germany, France, Spain and many others.

Most of the data scientists who answered the survey are already employed (96.2%). Looking at the highest studies of the data scientists we can see that 28.7% have their Bachelor's degree, 37.5% got their Master's degree, 14% got their Doctoral degree and only 10% have no studies.


Survey Results


When data scientists were asked which technology they are more excited to learn next year the top five choices were: TensorFlow, Python, R programming language, Spark / MLlib and Hadoop/Hive/Pig.

Python is without doubt the first programming language to learn for a person aspiring to become a data scientist, it was the most recommended language to learn for a new data scientists by the respondents of the survey.

Data scientists were asked to respond from where they get the datasets they use to practice data science skills. As seen below most of them use some kind of data aggregator such as Kaggle, Socrata or data.world. Here are the results
  1. Dataset aggregator/platform (i.e. Socrata/Kaggle Datasets/data.world/etc.) - 26%
  2. Google search - 13%
  3. University/Non-profit research group websites - 11%
  4. I collect my own data (e.g. web-scraping) - 10%
  5. Github - 9%
  6. Government website 8%
  7. None or other - 23%

Out of the 25% of the data scientists who answered the question "How long have you been learning data science?" about 85% of them are learning data science for less than 2 years.

The survey respondents were asked to choose one or more learnings platforms or ways which they use to improve their data science knowledge. The three most popular choices are: Kaggle, Online courses and YouTube Videos.


Out of the 23.6% of persons who answered the question whether Big Data knowledge is important to get a data science job 57.4% of them consider it nice to have, 37.9% of them consider it necessary and 5% consider it unnecessary.

The full list of results for the survey can be found on Kaggle's website here

References


  1. Kaggle ML and Data Science Survey, 2017 A big picture view of the state of data science and machine learning.
  2. Kaggle Insights - 2017 The State of Data Science & Machine Learning


Sunday, January 1, 2017

Book review - Java 8 in Action


Java 8 was released in the first half of 2014, and it's awesome! The most notable Java 8 features are Lambda expressions, Stream API and Date/Time APIs. There are also a lot of other improvements like default methods which allow addition of methods to interfaces without breaking existing implementations and allow multiple inheritance of behaviour, but not state. Annotation on Java Types and repeating annotations were also introduced in Java 8.

Even if I was using some new features of this update for some time, I felt it was time to get a deeper knowledge of Java 8. There are several books already published on this topic. I've chose "Java 8 in Action: Lambdas, streams, and functional-style programming" based on the good ratings and high number of positive reviews at the time searching for a book, and let me say it, I don't regret it! The book was amazing!

Java 8 in Action is easy to read, but it keeps you excited throughout the whole book. The authors have done a really good job explaining the new features of Java 8. Throughout the book, new concepts are followed by small exercises which can help you to validate your understanding of the concept. A learning benefit for you is that each chapter ends with a good summary of what was discussed.

The book is split into 4 parts, first part is called "Fundamentals" and it contains a short introduction about why these new features are relevant to you, as a developer. In this part you'll learn about "Passing code with behavior parameterization" and get introduced to Lambda expressions. Second part is called "Functional-style data processing" in which Streams are introduced, explained and exemplified. You will be taught how to "collect data with streams" and you also get to learn a lot about parallel data processing and performance. Third part is called "Effective Java 8 programming", in this chapter you'll be introduced to Default methods, to Optional as a better alternative to null, some new concurrency utilities and the new Date and Time API. Forth part is called "Beyond Java 8" in which you learn more about Functional Programming and in which there is an interesting chapter called "Blending OOP and FP: comparing Java 8 and Scala", you'll like it!

As you'll read the book you will realize that it is not just about Java 8 features but you will learn a lot of other things as well. I strongly believe that after reading this book you will not only be able to use Java 8 features but you will also become a better programmer.

I really enjoyed reading this book and I strongly recommend it to you as well. Enjoy it!

Saturday, August 27, 2016

Book review: The Pragmatic Programmer - From Journeyman to Master


These pages published in 1999 contain a tremendous amount of wisdom for all programmers who wish to become software craftsmen. The book contains a collection of ideas, observations and recommendations for programmers who aspire to raise the bar of professional software development.

While reading this book you will feel like you are having a conversion with two very skilled software craftsmen, you might agree or disagree with their ideas but you cannot argue their judgment.

The authors did a nice thing introducing each chapter with a small context attached, also when new concept or section is introduced it has references to things to which it relates. Also there are interesting exercices at the end of each section.

When reading about portability concerns you'll find links to dead technologies like CORBA or OS like Windows 2000, this will definitely make you reflect on how fast technology has evolved. You'll also appreciate that while sharing knowledge about various topics, the authors shared some good jokes as well.

The book has eight chapters, some of the most interesting topics discussed are: estimations, there are amazingly useful tips about estimating, debugging. Also things to do before a project starts, and topics about pragmatic projects and pragmatic teams. At some point in the book you'll also find explanations for Liskov Substitution and Demeter’s Law.

As a software developer you'll find that even if this book was written many years ago, the shared wisdom is still relevant today.

Monday, February 29, 2016

Microservice and Monolithic architectures exemplified

We'll use an example of an application to illustrate the differences between Microservice architecture and Monolithic architecture.  Let us take an e-commerce application which has the purpose of selling products online. In general all applications from this category require functionalities for browsing through available products, purchase them and placing orders which are later managed by administrators. And of course most of the e-commerce websites have some content which must be easily editable at any time from an administration dashboard.

Monolithic architecture


In the below diagram it is presented an e-commerce web application built following a monolithic architecture. As seen in the below diagram the data of the web application is stored in a single database. Products, website content, orders placed by customers and inventory information all stored in the same place. The web application has multiple purposes, first and most important to allow customers to browse through products and allow them to buy, secondly managing orders, content and maybe offering customers the possibility to create an account for managing different settings and subscriptions.
In this case the application built on a monolithic architecture offers both a web interface written using HTML, CSS and JavaScript and an Application Programming Interface which can be used by the clients such as Android or IOS smartphone apps.

E-commerce web application built using Monolithic architecture
E-commerce web application built using Monolithic architecture

Scaling for this kind of applications can be done both horizontally and vertically. The later means that the production machine will get its hardware improved by adding additional RAM memory, disk storage or adding a better CPU. Scaling vertically can be achieved by installing the same version of the application on multiple nodes and putting a load balancer between the nodes and the browser clients or API clients.

Microservice architecture


The below diagram shows a possible way to build the described e-commerce web application in a microservice architecture. As seen below the application was split into multiple small services each having its own purpose, managing orders, managing products and managing content. 
Each small service has its own database which contains the generated or used data by that particular service, so basically we do not have a big database but rather multiple small databases. This represents one of the big advantages of microservices because scaling of the database horizontally becomes possible. Of course that a data consistency issue appears when a complex flow, which passes through multiple services, fails somewhere in the middle of the processing. Assuring data consistency requires now a lot of extra effort, something which was previously easily achieved using database transactions.
The earlier mentioned API Gateway stands in front of the clients and incorporates the API offered by all microservices such that the clients know only about that API and are not aware of the other services behind it. This can be compared to the façade design pattern in which a single entry point for the clients is offered. Following this architecture makes it easier to later change the implementation or the API of the services as long as the API of the Gateway still remains the same.
Using Microservice architecture offers more options for scaling the application, for example one can install half of the services on one node and the other half on the other node. One can install each service on a separate node or even install each service twice on two different nodes, but then most probably a load balancer is needed.

E-commerce web application built using Microservice architecture
E-commerce web application built using Microservice architecture

In future posts we'll have look over the advantages and disadvantages of Microservice architecture and over its use in practice.


Wednesday, February 24, 2016

Introduction into Microservice architecture

The purpose of this post and future related ones is to understand which are the advantages and disadvantages of Microservice architecture. This blog post is meant to get you familiar with this topic and represents a first step towards reaching the end goal.
Understanding what microservice architecture is and then identifying the advantages and the disadvantages of this architecture, it is first required to define some concepts like software architecture and software pattern. 
Software architecture refers to the high level structures of a software system, the discipline of creating such structures, and documentation of these structures. These structures are needed to reason about the software system. Each structure comprises software elements, relations among them, and properties of both elements are relations. The architecture of a software system is a metaphor, analogous to the architecture of a building. Wiki
As microservices architecture is an architectural pattern let us have a look over its definition. An architectural pattern is a general, reusable solution, to a commonly occurring problem in software architecture within a given context. Architectural patterns are similar to software design pattern but have a broader scope. The architectural patterns address various issues in software engineering, such as computer hardware performance, limitations, high availability and minimization of a business risk. Some architectural patterns have been implemented within software frameworks. Wiki
Because the microservice architecture comes in contrast with the monolithic architecture let us shortly review what a monolithic architecture is. 
In software engineering, a monolithic architecture describes an architecture in which the application single-tiered software application, its user interface and data access code are combined into a single program from a single platform. A monolithic application is self-contained, and independent from the other computing applications. The design philosophy is that the application is responsible not just for a particular task, but can perform every step needed to complete a particular function. Martin Fowler article
Today, some personal finance applications are monolithic in the sense that they help the user carry out a complete task, end to end, and are “private data silos” rather than parts of a larger system of applications that work together. Martin Fowler article
Microservice architecture can be defined as an architectural pattern or style in which a software system is developed as a group of smaller services, each running in its own process and communicating with each other using lightweight mechanism such as HTTP. The services should be built around business capabilities and should be independently deployable. The services should be small, highly decoupled and should focus on doing small tasks, facilitating a modular approach for building a system.

Motivation


The development of monolithic applications becomes slower as the application grows, so does grows the frustration of the developers. Large applications are hard to manage and most often doing a small change takes days to identify the impact and hours to write the code, afterwards it might take few days to pass review sessions and run automated suite of tests. These issues are some of the causes which made developers to welcome the microservice architecture which breaks down things into manageable pieces.
Following this architecture means that a big team can be split into smaller teams organized around microservices, these small teams become autonomous and fully responsible of their developed service. Fear of change will not prevent developers to fix issues or create new features, besides this the efficiency and the development speed is visible increased.
Developers feel more comfortable when managing smaller code bases than managing a monolithic one, this means that creativity is stimulated and development frustration is less likely to appear.
Another great thing which is achieved using this architecture is that developers are kind of forced to develop the application into a modular way; it also allows reusing services over multiple applications.

More about microservices..

In future posts we'll discuss more about Microservice architecture vs Monolithic architecture, advantages and disadvantages of Microservices and putting this architecture to practice.

Saturday, January 30, 2016

StreamInsight components

Streams of data


Just as Microsoft SQL Server was designed to allow developers to manage static data, StreamInsight was designed to work with streams of data. But what does a stream of data me

an? Well a stream of data is a sequence of pieces of information, for each such piece of information a certain time is associated with it. Usually the associated time is the date-time of creation.

Such streams of data can be produced by countless devices which vary from smoke sensors, temperature sensors to smartphones, robots, web applications, hosting servers or trading applications. 

Event


An event can be defined as a basic unit of data processed by the StreamInsight server, each event encapsulates a piece of information thus we can say that a stream contains a sequence of events. Each event consists of two parts, the header and the payload.

The header defines the event kind and temporal properties of the event. All the temporal properties are application-based and supplied by the data source rather than a system time supplied by the StreamInsight server. All the timestamps use the .Net DateTimeOffset data type, also StreamInsight normalizes all times to UTC date-time automatically.

The payload is a .NET data structure which holds the data associated with the event, the fields of the data structure can be defined by the developer. Each field can have a .NET data type e.g. int, float, string etc.

Query


The same way as a fisherman uses a fishing net to catch fish from a river, we can use a StreamInsight query to retrieve relevant information from a stream of data. The results of the query are received incrementally for as long as we need.

One can define a myriad of queries starting from simple ones like selecting all events which fulfill a certain condition, to more complex like selecting certain events which appear in a window of 3 minutes.

The main difference between StreamInsight and a database is that StreamInsight never stores the data, a query is kept active all time while the server is running. Every time a new event appears it triggers a new computation and generates a new result. Of course if we are interested we can store the results of the queries.

Source



Devices which are data producers become sources of data for the StreamInsight application. One can define multiple sources of data which go into the StreamInsight server and against which queries are executed.

Sink


We have sources of data which become streams of data and queries which are executed against them, but how can we get our hands on the query results? Well we can define a sink in which StreamInsight will send the result of the defined queries. Also in this case we can define multiple custom sinks, some might represent a conventional database or an user interface in which users can acknowledge the information immediately.

StreamInsight components working together




As seen in the architecture diagram from above, one can define source of events like smartphones, fire sensors, smoke sensors, temperature sensors, server logs or event historical data. This platform allows developers to aggregate all these events by defining LINQ queries, the results of these queries are then passed to any developer defined sink like monitoring devices, monitoring applications or even data warehouses.

Saturday, January 2, 2016

StreamInsight

StreamInsight is a platform developed by Microsoft which allows developers to create and deploy complex event processing (CEP) applications. This platform is based on the existing .NET Microsoft platform and it enables developers to implement robust and highly efficient CEP applications. There are a lot of possible event sources, some of the most relevant are:

  • Financial trading applications
  • Web analytics
  • Manufacturing applications
  • Server monitoring applications

One can use this platform to easily create tools to monitor data from multiple sources for meaningful patterns, trends, exceptions and opportunities. Analyzing and correlation can be done incrementally while data is produced (in real time) without storing it first, which translates in to having a low latency application. As a source of events historical data can also be used.

Key Benefits


In the following I will try to talk about the most important features and advantages offered by this platform.

Highly optimized performance and data throughput


StreamInsight supports highly parallel execution of continuous queries over high-speed data because it implements a lightweight streaming architecture. The use of in-memory cache and result computation done incrementally provide an excellent performance with high data throughout and low latency. In StreamInsight all processing is automatically triggered by incoming events based on defined queries. Also the platform provides the functionality for handling out-of-order events and in addition static reference or historical data can be accessed and included in the low-latency analysis.

.NET development environment


Microsoft created the .NET development environment in which programming languages like C#, tools like Visual Studio and services like SQL Server can be easily integrated and used for applications development while still keeping the loose coupling between them. StreamInsight is included in this environment in which one can easily develop fast and robust applications. Developers can write their CEP applications using C#, leveraging the advanced language platform LINQ (Language Integrated Query) to create queries.

Given the fact that there is a large community of the developers already familiar with these Microsoft technologies the cost and time of the development of a CEP application is significantly reduced.

Flexible deployment capability


StreamInsight platform provides two ways of deployment scenarios. First is a fully integrated into the developed application as a hosted (embedded) DLL. The second way is deploying StreamInsight as a stand-alone server with multiple applications and users sharing the server. This means that one can develop multiple, independent, applications which use the same StreamInsight instance. The CEP server runs in a wrapper such as an executable or the server could be packaged as a Windows Service.

Extensibility


StreamInsight allows developers to extend its functionality by giving them the possibility to define their own operators, functions and aggregates to be used in queries and define specific event types against which to run the defined queries.

One of the great things about StreamInsight is that it was designed to seamlessly integrate with any domain specific business logic. This means that the platform does not come with any implemented functionality for specific business sectors but it allows developers to plugin any specific business logic.

CEP Query Visualization and Analysis


Microsoft StreamInsight provides a stand-alone Event Flow Debugger which is a powerful GUI tool that enables visual inspection of a continuous query. One can use this graphical tool to quickly inspect the query tree, replay data processing and perform analysis.


Latest version


Currently the latest version of StreamInsight is 2.3, this was released together with SQL Server 2014 on the first of April 2014. Release 2.3 contains only a licensing update, so any code written against the previous version, 2.1, will still work.


In the future posts about StreamInsight I will present some of its most important components.

Robert Rusu