The Biggest Challenge in Data Analytics
In college id like to believe they set up the environment as similar to the real world as possible. But we all know that is not true. Here is a personal anecdote of how it didn’t prepare me for my biggest challenge in data science so far.
When studying economics there are a lot of niche topics that you can dive into. From micro to macro or from public choice to game theory. One that truly drew me in was econometrics. Which essentially is the study of economics and statistics. This is also known today as data analysis, data science, machine learning, or ‘AI’. (Although AI today is refereed to anything that has to do with computers, which is unfortunate).
The 4 Major Components of Econometrics
Econometrics is essentially the building blocks of how modern machine learning algorithms work today. There are four major components of econometrics:
- Descriptive
- What happened?
- This is simple snapshot of your business to describe what happened. This would be looking at key performance indicators. We compile large volumes of data into a clear and simple overview of past performance. This is the most popular form of data analytics and can clearly show the ‘health’ of your business.
- Diagnostic
- Why did it happen?
- This type of analysis stems from descriptive analysis and seeks to understand why something happened. This would be helpful for finding anomalies within your data. But it is also used to help find the underlying drivers of positive results, like sales. There are numerous techniques for finding the ‘why’ but all are designed to locate the root cause your outcomes.
- Predictive
- What is likely to happen in the future?
- This type of analytics allows you to predict likely future events. Using past trends, we can estimate the likelihood of a future event. This enables businesses to plan ahead. Predictive analytics is especially good for detecting seasonal trends, whether that be an increase in sales during the summer or noticing an increase on weekends.
- Prescriptive
- What is the best course of action?
- This type of analysis looks at the three previous types and determines what actions should be taken. In other words, we look at what is most likely to happen in the future and we take advantage of that knowledge by preparing for it.
Challenges
To ‘prepare’ us for the real world we were told some of the biggest traps or problems data analyst fall into. One of the most common problems that anyone who has taken a statistics class before knows: correlation does not equal causation. Another is the problem of multicollinearity. Or improper cleaning of data before analysis or regression. Collecting enough data or at least enough meaningful data. Using the right analytical tools, like linear versus logistic regression. Or how to handle categorical/dummy variables. Knowing when to natural log a variable. Using proper language to explain your models. Use correct visualizations for the datatype. How to properly handle timeseries data.
All of these challenges (and more) I thought was going to very common in the workplace after college. I thought I would be sniffing out multicollinearity and creating beautiful timeseries models to help solve business solutions. That the hardest part would be perfecting these models and not falling into common traps. But no. Instead the hardest problem is something we never even touched on in college. That is business.
The Business Problem
There seems to be a huge disconnect between data science and business. Yet almost every business claims to be data-driven or use machine learning and AI. But in reality it is a façade to impress stakeholders. Because there is actually quite a bit of business politics that is centered around data.
One is that companies hold your data hostage. That’s right. They will collect your data for you. Store your data for you. They charge you extra if you want access to your data. How do they do this?
Hostage Situation
Well they do technically give you access to your data. But in order for you manipulate do what you want with it they make it very difficult to access. For example, at my current job in order access some of our data the company I need to a PowerShell Script within a 2 hour window each day to then import the daily flat file into a SQL Server. This is because this is the only way the company allows access to our data. Well not the only way. We can use their propriety data visualization tool if we want easier access. And of course that costs money. And no you can’t export the data from there, you can only use the data within their visualization interface. Although not necessary to say, but their tools don’t stack up to the visualization tools of Power BI, Tableau, Python, or R.
Getting around that hurdle was a huge challenge in my data analytics career because we were never taught about data hostage situations. It is the business side of data analytics that wasn’t taught.
Database Engineer Who?
Another challenge that was unforeseen was the database situation. I was taught in school about proper database management. How to name tables and columns. Putting in primary and foreign keys. And most importantly a data dictionary or key. This is crucial for database management so that way anyone who accesses the database can know exactly what each table is for, what each column in each table means, and how each table connects with one another through the use of primary and foreign keys. Well in the real world most businesses have none of that.
Instead what you receive is a database with a couple hundred tables. Each with five to a hundred columns. A no data dictionary to reference. You just have figure it out as you go. Now this obviously is specific to me but I have come to find out that it is a common occurrence in the work place.
On top of that the columns have a horrible naming convention that is not consistent. Some tables use “_” between words others use camelCase. Some columns in two different tables have the same name but have two totally different meanings. This is basic DB management 101 on what NOT to do when setting up a database
Basically there was either no database engineer to properly build the database or the one who did build the database had no clue what they were doing. All things we were never taught to handle going into data analytics. Which makes it one of the hardest problems that you will face in data analytics.