For us, the main objective of this small project was to learn more about the state of Data Science, as an interdisciplinary field. That is why we've used the publicly available dataset from Kaggle. In addition we also wanted to experiment with Apache Spark thus we've used this tool to process and extract the insights presented below.
Gathered insights
- Unfortunately the existing gender imbalance from the technology field is also present in the data science field. Only 16.9% of respondents are female which leads us to believe there is a big difference between the number of female data scientists and male data scientists.
- Python is the most recommended language to learn first for people aspiring to become a data scientist.
- Kaggle is one of the most popular learning platforms for data scientists.
Who are the data scientists?
When talking about gender, out of the 16,000+ persons, 82.6% are male data scientists and only 16.9% are female data scientists. And 0.5% chose not to answer the gender related question.
Most of the data scientists who answered the survey are already employed (96.2%). Looking at the highest studies of the data scientists we can see that 28.7% have their Bachelor's degree, 37.5% got their Master's degree, 14% got their Doctoral degree and only 10% have no studies.
Survey Results
When data scientists were asked which technology they are more excited to learn next year the top five choices were: TensorFlow, Python, R programming language, Spark / MLlib and Hadoop/Hive/Pig.
Data scientists were asked to respond from where they get the datasets they use to practice data science skills. As seen below most of them use some kind of data aggregator such as Kaggle, Socrata or data.world. Here are the results
- Dataset aggregator/platform (i.e. Socrata/Kaggle Datasets/data.world/etc.) - 26%
- Google search - 13%
- University/Non-profit research group websites - 11%
- I collect my own data (e.g. web-scraping) - 10%
- Github - 9%
- Government website 8%
- None or other - 23%
Out of the 25% of the data scientists who answered the question "How long have you been learning data science?" about 85% of them are learning data science for less than 2 years.
The survey respondents were asked to choose one or more learnings platforms or ways which they use to improve their data science knowledge. The three most popular choices are: Kaggle, Online courses and YouTube Videos.Out of the 23.6% of persons who answered the question whether Big Data knowledge is important to get a data science job 57.4% of them consider it nice to have, 37.9% of them consider it necessary and 5% consider it unnecessary.
The full list of results for the survey can be found on Kaggle's website here
References
- Kaggle ML and Data Science Survey, 2017 A big picture view of the state of data science and machine learning.
- Kaggle Insights - 2017 The State of Data Science & Machine Learning