Saturday, December 1, 2018

Machine Learning and Data Science

We've used Apache Spark to extract insights from Kaggle's Survey "The State of Data Science & Machine Learning" where Kaggle gathered more than 16,000 responses.

For us, the main objective of this small project was to learn more about the state of Data Science, as an interdisciplinary field. That is why we've used the publicly available dataset from Kaggle. In addition we also wanted to experiment with Apache Spark thus we've used this tool to process and extract the insights presented below.

Gathered insights

  1. Unfortunately the existing gender imbalance from the technology field is also present in the data science field. Only 16.9% of respondents are female which leads us to believe there is a big difference between the number of female data scientists and male data scientists.
  2. Python is the most recommended language to learn first for people aspiring to become a data scientist.
  3. Kaggle is one of the most popular learning platforms for data scientists.

Who are the data scientists?

When talking about gender, out of the 16,000+ persons, 82.6% are male data scientists and only 16.9% are female data scientists. And 0.5% chose not to answer the gender related question.

Most of the data scientists who answered are under 35 years old. Distribution can be seen in the below bar chart. In addition 10% of the data scientists who answered 60 years old or over.
Looking at the countries of the data scientists who responded the survey we can see that 25.1% of them are from United states, 16.2% are from India, 3.5% are from Russia and 3.2% are from United Kingdom. As seen below, the rest of 52% of the data scientists are from other countries such as Germany, France, Spain and many others.

Most of the data scientists who answered the survey are already employed (96.2%). Looking at the highest studies of the data scientists we can see that 28.7% have their Bachelor's degree, 37.5% got their Master's degree, 14% got their Doctoral degree and only 10% have no studies.


Survey Results


When data scientists were asked which technology they are more excited to learn next year the top five choices were: TensorFlow, Python, R programming language, Spark / MLlib and Hadoop/Hive/Pig.

Python is without doubt the first programming language to learn for a person aspiring to become a data scientist, it was the most recommended language to learn for a new data scientists by the respondents of the survey.

Data scientists were asked to respond from where they get the datasets they use to practice data science skills. As seen below most of them use some kind of data aggregator such as Kaggle, Socrata or data.world. Here are the results
  1. Dataset aggregator/platform (i.e. Socrata/Kaggle Datasets/data.world/etc.) - 26%
  2. Google search - 13%
  3. University/Non-profit research group websites - 11%
  4. I collect my own data (e.g. web-scraping) - 10%
  5. Github - 9%
  6. Government website 8%
  7. None or other - 23%

Out of the 25% of the data scientists who answered the question "How long have you been learning data science?" about 85% of them are learning data science for less than 2 years.

The survey respondents were asked to choose one or more learnings platforms or ways which they use to improve their data science knowledge. The three most popular choices are: Kaggle, Online courses and YouTube Videos.


Out of the 23.6% of persons who answered the question whether Big Data knowledge is important to get a data science job 57.4% of them consider it nice to have, 37.9% of them consider it necessary and 5% consider it unnecessary.

The full list of results for the survey can be found on Kaggle's website here

References


  1. Kaggle ML and Data Science Survey, 2017 A big picture view of the state of data science and machine learning.
  2. Kaggle Insights - 2017 The State of Data Science & Machine Learning


No comments:

Post a Comment