Routine for Starting a Data Science Project in R
Routine is mostly a good thing. Morning routine, gym routine, bedtime routine, etc. Thanks to routine or good habit, one doesn't spend too much time and energy on deciding on what/how to do it, saving energy for more important questions like "why".
Routine is mostly a good thing for data scientist, too. Here's my routine for starting a new data science project in R, large or small:
- Create a github repo for the project with sensible name, all lowercase and dash, no underscore (~1min)
git clone
to my usual project directory (~/projects/
) (30sec)- Write
README.md
for what the project is about (~1min) - Fire up Rstudio and create RStudio project (
.Rproj
) in the directory (~1min) - Write the first R script, typically named
initial-analysis.R
- First few lines of the scripts are almost always the same, like:
library(tidyverse)
df <- read_csv("datafile")
glimpse(df)
df %>% ggplot(aes(x, y)) + geom_....
: yes... this is where things start to diverge...
So, that's about 10min to hit the ground running and start producing useful stuff.
Once things start rolling, daily routines are similar:
- Bunch of data massaging, like:
df %>%
group_by(x) %>%
filter(y %in% c("good", "fine")) %>%
summarize(mz=median(z))
- ... and visualization:
df %>%
ggplot(aes(x, y)) +
geom_... +
facet_wrap(~w)
- ... and reporting:
rmarkdown::render("that-special-markdown.Rmd")
- ... and
git commit
/git push
frequently. - Talk to the stakeholders for questions, news, etc.
But, overall, fairly automatic, fast, and effective. Yes, routine is mostly a good thing.
What's your routine for starting a data science project in R?
Very different from mine??
Let me (and the world) know!