notes.velouria.dev/content/Posts/the-long,-meandering-road-t...

85 lines
14 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: The long, meandering road to pd.read_csv()
draft: false
tags:
- tech
- data
date: 2020-10-02
---
A few days ago I saw a tweet that referred to this question in Reddit: [“whats part of the real job thats not part of the Kaggle workflow?”](https://www.reddit.com/r/MachineLearning/comments/j1ati3/d_whats_part_of_the_real_job_thats_not_part_of/).
There are many answers to this question but one that Ive had in mind for a long while is this: putting together a dataset. The following tweet also echoes the same sentiment:
> One of the biggest failures I see in junior ML/CV engineers is a complete lack of interest in building data sets. While it is boring grunt work I think there is so much to be learned in putting together a dataset. It is like half the problem.— Katherine Scott (@kscottz) [February 1, 2019](https://twitter.com/kscottz/status/1091423467772162049?ref_src=twsrc%5Etfw)
Now lets go back to the Reddit post.
The reality is, in real-world situations, Step (1) is rarely just that. In fact I cant recall the time someone hands me a .csv dataset while saying “you know you just need to load _this_ & get started.” Nope. Never.
Step (6)… yeah, sure, thats usually part of the workflow. But models are not the only things you can iterate on. You can also iterate on your data, which means sometimes you have to go back to Step (1) again. Speaking from personal experience, some of the most impactful performance increase can be gained from iterating on the data.
These two things bring me to my point: we need to talk more about putting together a dataset, because of two reasons: 1) outside of Kaggle, oftentimes you must always build your own dataset first; 2) sometimes you dont only do it once for the same project, but maybe twice, or thrice, at various parts of the project.
## Data and where to find them
Heres the cold hard truth: most of the time, the dataset of your dreams—that mythical one piece of ready-to-use .csv file—might not exist.
You will often face a situation where you have a very limited dataset or worse, your dataset does not exist yet. Below are some common challenges I can think of, though keep in mind that a) encountering > 1 of these challenges for the same project are very possible, b) whether youll come across them or not may depend on your companys data maturity.
**Your company doesnt know theyd need it in the future, so they dont collect them.** This is probably less probable in large companies with resources to have a data lake, but theres still possibility that this happens. Theres probably not much that you can do other than making the case that you need to collect this data & justify the budget & time needed to do it. At this stage your persuasion skillz are probably more important than your SQL skillz.
**The data that you need do exist in a data lake, but they need to be transformed before you start using it.** Transforming it might become someones backlog that might not be picked up until the next few sprints. Either way, you wont have that dataset that you want immediately.
**You have your data, but they are not labeled.** This might be a problem if you are trying to do a supervised learning task. A quick solution might be to hire annotators from [Mechanical Turk](https://www.mturk.com/mturk/welcome), but this might not be possible if you have a domain-specific data or sensitive data where masking makes the annotation task impossible. Ive also seen companies having “Data Labeler” as one of their job openings, but you might have to think whether it makes sense/possible for your company to hire someone part/full-time to label your data.
Once you have annotators, you might also want to strategize the kind of labels that you need so that they can be useful for future cases to save cost and time, so you dont need to label the same data twice. For example, if you need to label a large set of tweets with “Positive” vs “Negative”, you probably need to anticipate future needs by making more granular labels instead (e.g. “Happy”, “Sad”, etc.).
You can always try other approaches, e.g. semi-supervised learning or unsupervised learning. But its not always possible, & considering various constraints, sometimes you need to really calculate which one is more worth it, e.g. pursuing a semi-supervised learning approach that you still need to explore vs a supervised learning approach you know better with the help of annotators. This may depend on various factors: time, budget, etc.
**You have your data, but they are weakly labeled.** These labels do not necessarily correspond to the target of your model, but you can use them as some sort of a proxy for it. For example, you may not have the data whether a user likes an item or not, but perhaps you can infer that information from the number of times the user views the items. Of course, you need to think whether it makes sense in your case to have this information as a proxy.
**You have your data, but the target labels are still fuzzy**. Some problems are not as straightforward as “this image contains Cheetos” & “this image does not contain Cheetos”. Sometimes stakeholders come up to you and say that they want more granular predictions that they can tweak later on. At this point you may need to work very closely with your business stakeholders to figure out the target labels, & how you can make the data that you have work with such requests.
**You think you have your data, but you dont know where they are**. Say you work in a bank. You know you must have transactions data, & there is no way you dont have users data. However, you may not know where they are, what they look like, the filters you need to use to filter the data, the keys you can use to join the two tables together (hint: its not always that simple). Documentations may exist but the details could still be fuzzy to you. You need to ask someone. But who? You need to find out. You ask them questions. They may or may not respond to your queries quickly because they also have job to do. The data might or might not contain the fields that you expect, & it turns out to get the dataset that you want there is something more than a simple join between two tables.
## Dont trust your data right away
Okay, great, you have your data. Can we load them for training now? Not so fast. You may need to spend sometime to make sure that your dataset is reliable—that it _actually_ contains the things that you expect them to contain. This is a tricky one Id say because the definition of “reliable” differs from each case, so you really have to define it yourself. Much of this comes down to how well you understand your problem, how well you understand your data, & how careful you are.
_Hold up, this is part of the data cleaning (Step 2), isnt it? We can just drop missing data etc. etc. no?_ If you refer to most data cleaning tutorials, they make it seem that its straightforward: you can just drop the rows with missing values/impute them with the mean of the column or something, _and then you can go to Step 3_. But in reality: a) you often get some very funky cases that these tutorials may not cover, b) these funky cases may be symptoms to a larger issue in the engineering pipeline that you may start questioning the reliability of your entire dataset, not just that one particular column. [_Oh, take me back to the start_](https://www.youtube.com/watch?v=RB-RcX5DS5A), sings Coldplay, & off to the start (aka Step 1) you go.
Some common pitfalls off the top of my head:
- **Erroneous labels.** This is especially when these are human-annotated labels. Going through the data by hand can be helpful for you to get a sense of this—even if you dont manage to go through all of them (which is understandable—scrolling through some data for 5 hours might not be the wisest use of your time anyway), you can have at least a sense of the percentage of human level error (if you dont have one yet). Knowing the human level error can help you calculate what [Andrew Ng refers to as the avoidable bias](https://media.nips.cc/Conferences/2016/Slides/6203-Slides.pdf), & knowing the avoidable bias can be helpful for you to determine your next step.
- **Missing data.** Missing data does not only mean fields with empty values that you can find with `df[df["col_1]".isnull()]`. Say that you have a column that says “Province”. Out of all your data, there is no value that says “West Java”. Does that make sense? Can you trust your data? Sometimes its not always about whats there, but also about whats _not_ there.
- **Duplicated data.** Sounds trivial—we can just do `pd.drop_duplicates`, no? It depends. If you have images for example, you might also want to think about what kind of duplication you want to remove (do you only want to remove exact duplicates or near-duplicates too?). _Fun fact:_ 3.3% and 10% of the images from the test sets of the oft-used CIFAR-10 and CIFAR-100 have duplicates in the training set[1](https://galuh.me/dataset/#fn-1), & this was discovered only pretty recently in 2019.
- **Values that just dont make sense.** Again, this _really_ depends on the context of your data. Example: you have rows where the `registration_date` is somehow more recent than the `last_transaction_date`. Does it make sense? _Can we detect these strange values using outlier detection?_ you might ask. Well, what if _most_ of your values are strange values? You never really know until you take a look at the data yourself.
- **Assumptions.** Its easy (& dangerous!) to make assumptions with data: sometimes we just assume that a field is generated in a certain way, we just assume that these values mean what we think they mean, & the list goes on. Maybe we have dealt similar tables before, or we have just dealt with similarly named colum names in a different table, so we carry these assumptions to our next data or project. Its worth it to spare some extra time to check the documentation or ask people who may know better before you go too far.
There is no recipe here except to really get to know your data & be critical of the data that you have, which means you probably need to spend some time to really look at it, & slice & dice your data in many different ways.
I usually spare some time to manually scan through my data just to have a sense of whats going on. From such a simple exercise I can get a sense, for example, that humans typically misclassify certain classes, so the labels related to these classes are probably not reliable, & its probably understandable if my model makes a few mistakes in these classes. Andrej Karpathy also did this exercise on the ImageNet data & wrote about what he learned in his [blog](http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/).
## You may have to revisit your dataset multiple times
When it turns out your models do not perform well, there are a few things you can do here. Its important to remember that its not limited to tuning your hyperparameters.
One of them is revisiting your dataset. When you do, you may decide to acquire a bigger dataset. You may try various data augmentation strategies, & when you do you still need to be critical of your data & the methods you apply (does it make sense to apply rotation for images of numbers?). You may decide that you need add more (better) features, but they dont exist in the tables yet, so you need to talk to your peers who handle this & see if you can get them in time. Its Step (1) all over again, & its fine. It happens. But at least knowing all these, youre more prepared now.
## How can I practice?
If you cannot learn all of these from Kaggle, then how can you learn it by yourself if you dont have stakeholders & access to company data that comes with all shapes & sizes with their own mishaps to practice on?
I think building your own side projects outside of Kaggle problems can be a great way for you to familiarize yourself with these challenges. **The most important thing is you do not start with the data that you are given, but you start with defining your own problem statement & search for datasets that are relevant to your problem instead of the other way around**. If the perfect dataset for your problem doesnt exist (most likely it doesnt), then its a good time to practice: fetch the data yourself (for example, using the Twitter API), join them with other data sources that you can, say, find in [Google Dataset Search](https://datasetsearch.research.google.com/) or [Kaggle Datasets](https://www.kaggle.com/datasets), find ways to use the weakly labeled datasets, & get creative with the imperfect data that you have.
## Further readings
Some work related to this topic that you might find interesting:
- [Real Scientists Make Their Own Data](https://seanjtaylor.com/2013/01/26/real-scientists-make-their-own-data.html) by Sean J. Taylor - argues that “your best chance to make a serious contribution as a business or academic researcher is to find, make and combine novel data”
- [Data Cleaning IS Analysis, Not Grunt Work](https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt) by Randy Au - proposes that data cleaning is not just menial work; “data cleaning is just reusable transforming”
- [Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon](https://www.danah.org/papers/2012/BigData-ICS-Draft.pdf) by danah boyd and Kate Crawford touches on a lot of issues related to increased automation of data collection & data analysis.