The first thing most people think about when they hear the term “data science” is usually “machine learning”. Obviously, to be a “complete” data scientist, you’ll have to eventually learn about machine learning concepts. But you’d be surprised at how far you can get without it.
So why shouldn’t you start with machine learning?
1. Machine learning is only one part of a data scientist (and a very small part too).
Machine learning is (a part of) data science but data science isn’t necessarily machine learning, similar to how a square is a rectangle but a rectangle isn’t necessarily a square.
In reality, machine learning modeling only makes up around 5–10% of a data scientist’s job, where most of one’s time is spent elsewhere. By focusing on machine learning first, you’ll be putting in a lot of time and energy, and getting little in return.
2. Fully understanding machine learning requires preliminary knowledge in several other subjects first.
At its core, machine learning is built on statistics, mathematics, and probability. The same way that you first learn about English grammar, figurative language, and so forth to write a good essay, you have to have these building blocks set in stone before you can learn machine learning.
To give some examples:
- Linear regression, the first “machine learning algorithm” that most bootcamps teach first is really a statistical method.
- Principal Component Analysis is only possible with the ideas of matrices and eigenvectors (linear algebra)
- Naive Bayes is a machine learning model that is completely based on Bayes Theorem (probability).
And so, conclude with two points. One, learning the fundamentals will make learning more advanced topics easier. Two, by learning the fundamentals, you will already have learned several machine learning concepts.
3. Machine learning is not the answer to every data scientist’s problem.
Many data scientists struggle with this. Similar to my initial point, most data scientists think that “data science” and “machine learning” go hand in hand. And so, when faced with a problem, the very first solution that they consider is a machine learning model.
But not every “data science” problem requires a machine learning model.
In some cases, a simple analysis with Excel or Pandas is more than enough to solve the problem at hand.
In other cases, the problem will be completely unrelated to machine learning. You may be required to clean and manipulate data using scripts, build data pipelines, or create interactive dashboards, all of which do not require machine learning.
What should you do instead?
If you would like some tangible next steps to start with instead, here are a couple:
- Start with statistics: Of the all building blocks, statistics is the most important. And if you dread statistics, data science probably isn’t for you.
- Learn Python and SQL: If you’re more of an R kind of guy, go for it. The better you are at Python and SQL, the easier your life will be when it comes to data collection, manipulation, and implementation. Also be familiar with Python libraries like Pandas, NumPy, and Scikit-learn.
- Learn linear algebra fundamentals. Linear algebra becomes extremely important when you work with anything related to matrices. This is common in recommendation systems and deep learning applications. If these sound like things that you’ll want to learn about in the future, don’t skip this step.
- Learn data manipulation. This makes up at least 50% of a data scientist’s job. More specifically, learn more about feature engineering, exploratory data analysis, and data preparation.