Jekyll2019-09-19T04:06:31+00:00https://dataninja.me/Data Science FunR, 파이썬(Python), 데이터 과학 (data science) 이야기. 실리콘밸리 데이터과학자 권재명 (Jaimie Kwon) 의 홈페이지입니다.Moving from WordPress.com to Github Pages to Netlify2017-12-31T00:00:00+00:002017-12-31T00:00:00+00:00https://dataninja.me/2017/12/31/moving-from-wordpress-com-github-pages-to-netlify<p><strong>TL;DR: I migrated my homepage from
WordPress.com to GitHub Pages for speed and flexibilty,
then to Netlify for HTTPS support.</strong></p>
<p><strong><em>Update (5/1/2018): Github Pages now supports HTTPS for custom domain:
<a href="https://github.blog/2018-05-01-github-pages-custom-domains-https/">https://github.blog/2018-05-01-github-pages-custom-domains-https/</a> so netlify is not necessary for HTTPS</em></strong></p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center">WordPress.com</th>
<th style="text-align: center">GitHub Pages</th>
<th style="text-align: center">Netlify</th>
</tr>
</thead>
<tbody>
<tr>
<td>Easy to Use?</td>
<td style="text-align: center">Yes</td>
<td style="text-align: center">Use Git+Markdown</td>
<td style="text-align: center">Use Git+Markdown</td>
</tr>
<tr>
<td>Cost</td>
<td style="text-align: center">Free for basic</td>
<td style="text-align: center">Free for basic</td>
<td style="text-align: center">Free for basic</td>
</tr>
<tr>
<td>Load Speed?</td>
<td style="text-align: center">Slow</td>
<td style="text-align: center">Fast</td>
<td style="text-align: center">Fast</td>
</tr>
<tr>
<td>Flexibility</td>
<td style="text-align: center">Low</td>
<td style="text-align: center">Very flexible</td>
<td style="text-align: center">Very flexible</td>
</tr>
<tr>
<td>HTTPS for Custom Domain</td>
<td style="text-align: center">Yes</td>
<td style="text-align: center">No</td>
<td style="text-align: center"><strong>Yes</strong></td>
</tr>
<tr>
<td>Build logs</td>
<td style="text-align: center">NA</td>
<td style="text-align: center">No</td>
<td style="text-align: center"><strong>Yes</strong></td>
</tr>
</tbody>
</table>
<p>Table: Comparison of the 3 platforms.</p>
<h2 id="1-from-wordpresscom-">1. From Wordpress.com …</h2>
<p>I have been using <a href="https://wordpress.com">WordPress.com</a> to host my homepage at
<a href="http://dataninja.me">http://dataninja.me</a> for almost a year.
It was a good solution for quickly spinning up the site, but
I hit the limit as I was planning to put in more professional contents (python and R markdown stuff).
Overall, for my use cases, WordPress.com was too:</p>
<ul>
<li>slow (heavy overhead),</li>
<li>ugly (espeically Korean and highlighted codes), and</li>
<li>inflexible (e.g. cannot change fonts easily; adding custom pages are hard).</li>
</ul>
<h2 id="2-moving-to-github-pages-">2. Moving to Github Pages …</h2>
<p>So, I considered <a href="https://pages.github.com/">GitHub Pages</a> / <a href="https://jekyllrb.com/">Jekyll</a> again.
Last time when I tried to use it (~3 yrs ago),
the static site build process was clunky and it took some work to make the site look pretty.
It seems the situation has changed a lot over past couple of years.
Now, the default theme (<a href="https://github.com/jekyll/minima">minima</a>)
looks OK and the build process is simpler and faster.
So, unlike WordPress.com, GitHub page is:</p>
<ul>
<li>fast to load,</li>
<li>pretty by default,</li>
<li>flexible, and</li>
<li>free.</li>
</ul>
<p>I use GitHub and <a href="https://daringfireball.net/projects/markdown/syntax">Markdowns</a>,
especially, <a href="http://rmarkdown.rstudio.com/">R Markdown</a>,
almost daily anyway. So why not??
So, I decide to switch back.
The process looked like this on OSX.
(Replace <a href="https://github.com/jaimyoung/jaimyoung.github.io">https://github.com/jaimyoung/jaimyoung.github.io</a> with your own GitHub Pages repo.)</p>
<ol>
<li>Backup the old GitHub Pages repo, if you had one already.
<ul>
<li>(In my case, this meant
moving <a href="https://github.com/jaimyoung/jaimyoung.github.io">https://github.com/jaimyoung/jaimyoung.github.io</a> to
<a href="https://github.com/jaimyoung/jaimyoung.github.io-old">https://github.com/jaimyoung/jaimyoung.github.io-old</a>)</li>
</ul>
</li>
<li>
<p>Install brew-maintained ruby not to pollute the OSX system ruby
(Of course, your OSX must have <a href="https://brew.sh/">homebrew</a> installed already):</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> brew install ruby
</code></pre></div> </div>
</li>
<li>
<p>Install Jekyll per <a href="https://jekyllrb.com/docs/quickstart/">Jekyll Quickstart Guide</a>, i.e.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> gem install jekyll bundler
jekyll new jaimyoung.github.io
cd jaimyoung.github.io
</code></pre></div> </div>
</li>
<li>
<p>Create github page repo
<a href="https://github.com/jaimyoung/jaimyoung.github.io">https://github.com/jaimyoung/jaimyoung.github.io</a>, and
make the <code class="highlighter-rouge">jaimyoung.github.io</code> directory to track the github repo:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> echo "# jaimyoung.github.io" >> README.md
git init
git add README.md
git commit -m "first commit"
git remote add origin git@github.com:jaimyoung/jaimyoung.github.io.git
git push -u origin master
</code></pre></div> </div>
</li>
<li>
<p>Serve the test site on <a href="http://localhost:4000/">http://localhost:4000/</a> by running:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> bundle exec jekyll serve
</code></pre></div> </div>
</li>
<li>Import old WordPress.com contents per <a href="http://import.jekyllrb.com/docs/wordpressdotcom/">WordPress.com to Jekyll import guide</a>.
<ol>
<li>
<p>First you need to install <code class="highlighter-rouge">jekyll-import</code> (<a href="https://github.com/jekyll/jekyll-import">https://github.com/jekyll/jekyll-import</a>):</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> gem install jekyll-import
</code></pre></div> </div>
<p>and a couple other dependencies.</p>
</li>
<li>Then, export the old WordPress.com contents per direction to some xml file.
(In my case, I exported from <a href="https://statkwon.wordpress.com/wp-admin/export.php">https://statkwon.wordpress.com/wp-admin/export.php</a>
to <code class="highlighter-rouge">datasciencefun.wordpress.2017-12-25.xml</code>).</li>
<li>
<p>Copy it over the the blog directory and
import to the page like:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ruby -rubygems -e 'require "jekyll-import";
JekyllImport::Importers::WordpressDotCom.run({
"source" => "datasciencefun.wordpress.2017-12-25.xml",
"no_fetch_images" => false,
"assets_folder" => "assets"
})'
</code></pre></div> </div>
</li>
<li>Browse the imported files (mostly under <code class="highlighter-rouge">_pages</code> folder in my case),
clean them up as needed,
and move them to correct folders or collections.</li>
</ol>
</li>
<li>Modify theme files as needed.
<ul>
<li>In my case, I had to overwrite:
<ul>
<li><code class="highlighter-rouge">_includes/head.html</code> : to add font-awesome</li>
<li><code class="highlighter-rouge">_includes/header.html</code> : to use custom navigation menu at the top</li>
<li><code class="highlighter-rouge">_includes/footer.html</code> : to add linkedin contact as well</li>
</ul>
</li>
<li>
<p>Find the theme files <a href="https://jekyllrb.com/docs/themes/">https://jekyllrb.com/docs/themes/</a> by running:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> bundle show minima
open $(bundle show minima)
</code></pre></div> </div>
<p>copy the files to your repo, and make necessary changes.
See my github repo for the changes I made to above files.</p>
</li>
</ul>
</li>
<li>
<p>Use <a href="https://jekyllrb.com/docs/collections/">Jekyll Collections</a>
to organize “따라 하며 배우는 데이터 과학” (my Korean data science book)
pages under
<code class="highlighter-rouge">_ipds-kr/</code> directory.</p>
</li>
<li>
<p>Set up redirect for a few pages,
using <a href="https://help.github.com/articles/redirects-on-github-pages/">Redirects on GitHub Pages</a>.
In my case, I wanted to move <code class="highlighter-rouge">ipds-kr-slides-ppt</code> to <code class="highlighter-rouge">ipds-kr/slides-ppt</code>, etc.</p>
</li>
<li>
<p>Set up and add <a href="https://disqus.com/">Disqus</a> for comments.</p>
</li>
<li>
<p>Set up and add <a href="https://analytics.google.com">Google Analytics</a> for site analytics.</p>
</li>
<li>Transfer <code class="highlighter-rouge">dataninja.me</code> custom domain from WordPress.com to
GitHub Pages per <a href="https://help.github.com/articles/setting-up-an-apex-domain/">directions</a>.
<ul>
<li>github help page is a bit unclear; google is your friend here, which gives you
a more thorough instructions on specific name provider.
For namecheap.com for example, <a href="https://www.namecheap.com/support/knowledgebase/article.aspx/9645/2208/how-do-i-link-my-domain-to-github-pages">this page</a> was helpful.</li>
</ul>
</li>
</ol>
<h2 id="3-then-to-netlify">3. Then to Netlify…</h2>
<p>Now everything <em>almost</em> works.
But it turned out that <strong>github pages doesn’t support HTTPS for custom domain.</strong>
This is a huge problem for me since:</p>
<ol>
<li>There already are quite a few
<strong>https:</strong>//dataninja.me domain names on the Facebook, LinkedIn, etc.,</li>
<li>Chrome browser doesn’t handle <a href="https://superuser.com/questions/565409/chrome-how-to-stop-redirect-from-http-to-https">https to http change easily</a>,</li>
<li>Google search engine indexing values https a lot higher than http sites, and</li>
<li>there seems to be <a href="https://github.com/isaacs/github/issues/156">no plan to support https for custom domtain in github</a></li>
</ol>
<p>Now, Netlify(<a href="https://www.netlify.com/">https://www.netlify.com/</a>) comes to the rescue.
It is mentioned in the <a href="https://github.com/isaacs/github/issues/156">above github thread</a>,
as a great (free) solution that provides HTTPS support for custom domains.
I also found it mentioned in some R markdown/bookdown/blogdown sites, so it looked reputable.</p>
<p>The process was pretty simple and took ~10 minutes:</p>
<ol>
<li>Set up netlify account and hook it up to github repo and start building.
<ol>
<li>
<p>Initially, the build failed (of course) but it was easy to troubleshoot
thanks to the build logs like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 6:55:18 PM: ruby_dep-1.5.0 requires ruby version >= 2.2.5, which is incompatible with the
current version, ruby 2.1.2p95
</code></pre></div> </div>
<p>So, that’s a lot more transparent than github (+1).</p>
</li>
<li>Thanks to the logs, I could track the build failure to the wrong ruby version.
To fix, I added <code class="highlighter-rouge">.ruby-version</code> with <code class="highlighter-rouge">2.4.2</code> per <a href="https://www.netlify.com/blog/2016/10/18/how-our-build-bots-build-sites/">help page</a>
and the build succeeded.</li>
<li>Now, the page is up and running at <a href="https://${random_words}.netlify.com/">https://${random_words}.netlify.com/</a>.</li>
</ol>
</li>
<li>Set up DNS.
Netlify provides their own nameservers, so it was pretty simple to follow their
directions.</li>
<li>Final step is, the original requirement, to get HTTPS working.
Again, this was super simple and HTTPS certificate was up and running in a few minutes.</li>
</ol>
<p>After these, the site is now up and running at <a href="https://dataninja.me/">https://dataninja.me/</a>.
Pretty sweet!</p>
<h3 id="making-disqus-comments-working-on-netlify">Making Disqus comments working on Netlify</h3>
<p>To use Disqus comments, one adds the following line in <code class="highlighter-rouge">_config.yml</code>:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>disqus:
shortname: dataninja-me
</code></pre></div></div>
<p>This works in github pages, but Netlify version doesn’t activate the comments.
It is because minima theme has the <a href="https://github.com/jekyll/minima/blob/master/_includes/disqus_comments.html">following lines</a></p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if `jekyll.environment == "production"`:
</code></pre></div></div>
<p>until it activate disqus comments.
The environment is set by Github Pages when the site builds,
but Netlify doesn’t, hence no Disqus comments.
To fix it, per <a href="https://jekyllrb.com/docs/configuration/">Netlify config directions</a>,
set this environment variable in the site deploy setting:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`JEKYLL_ENV=production`
</code></pre></div></div>
<p>(The URL looks like <a href="https://app.netlify.com/sites/pensive-keller-afeae1/settings/deploys#build-environment-variables">https://app.netlify.com/sites/pensive-keller-afeae1/settings/deploys#build-environment-variables</a> in my case).</p>
<p>Voilà, now the comments works on Netlify.</p>
<h2 id="conclusions">Conclusions</h2>
<p>I described how I migrated my homepage from
WordPress.com to GitHub Pages for speed and flexibilty,
then to Netlify for HTTPS support.</p>
<h2 id="update-in-31118">Update in 3/11/18</h2>
<p>After notebook migration,
my xcode build pipeline broke,
which prevents me from installing jekyll =(
After 2 hrs of trying to re-install xcode,
I concluded it’s not worth it and got lazy(!)
and decided to use Docker (of course).
Per <a href="https://github.com/BretFisher/jekyll-serve">https://github.com/BretFisher/jekyll-serve</a>,
This is all I need:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run -p 80:4000 -v $(pwd):/site bretfisher/jekyll-serve
</code></pre></div></div>
<p><strong>Lesson: spend at most 1 hr on your own IT;
Look for Docker solution after that.</strong></p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://learn.cloudcannon.com/jekyll/why-use-a-static-site-generator/">Getting started with Jekyll - Series</a> by
CloudCannon</li>
</ul>TL;DR: I migrated my homepage from WordPress.com to GitHub Pages for speed and flexibilty, then to Netlify for HTTPS support.Routine for Starting a Data Science Project in R2017-08-23T16:51:01+00:002017-08-23T16:51:01+00:00https://dataninja.me/2017/08/23/routine-for-starting-a-data-science-project-in-r<p>Routine is mostly a good thing. Morning routine, gym routine, bedtime routine, etc. Thanks to routine or good habit, one doesn't spend too much time and energy on deciding on what/how to do it, saving energy for more important questions like "why".</p>
<p>Routine is mostly a good thing for data scientist, too. Here's my routine for starting a new data science project in R, large or small:</p>
<ul>
<li>Create a github repo for the project with sensible name, all lowercase and dash, no underscore (~1min)</li>
<li><code>git clone</code> to my usual project directory (<code>~/projects/</code>) (30sec)</li>
<li>Write <code>README.md</code> for what the project is about (~1min)</li>
<li>Fire up Rstudio and create RStudio project (<code>.Rproj</code>) in the directory (~1min)</li>
<li>Write the first R script, typically named <code>initial-analysis.R</code></li>
<li>First few lines of the scripts are almost always the same, like:
<ul>
<li><code>library(tidyverse)</code></li>
<li><code>df <- read_csv("datafile")</code></li>
<li><code>glimpse(df)</code></li>
<li><code>df %>% ggplot(aes(x, y)) + geom_....</code> : yes... this is where things start to diverge...</li>
</ul>
</li>
</ul>
<p>So, that's about 10min to hit the ground running and start producing useful stuff.</p>
<p>Once things start rolling, daily routines are similar:</p>
<ul>
<li>Bunch of data massaging, like:
<ul>
<li><code>df %>% </code></li>
<li><code>group_by(x) %>% </code></li>
<li><code>filter(y %in% c("good", "fine")) %>% </code></li>
<li><code>summarize(mz=median(z))</code></li>
</ul>
</li>
<li>... and visualization:
<ul>
<li><code>df %>% </code></li>
<li><code>ggplot(aes(x, y)) + </code></li>
<li><code>geom_... +</code></li>
<li><code>facet_wrap(~w)</code></li>
</ul>
</li>
<li>... and reporting:
<ul>
<li><code>rmarkdown::render("that-special-markdown.Rmd")</code></li>
</ul>
</li>
<li>... and <code>git commit</code> / <code>git push</code> frequently.</li>
<li>Talk to the stakeholders for questions, news, etc.</li>
</ul>
<p>But, overall, fairly automatic, fast, and effective. Yes, routine is mostly a good thing.</p>
<p><strong>What's your routine</strong> for starting a data science project in R?</p>
<p>Very different from mine??</p>
<p>Let me (and the world) know!</p>Routine is mostly a good thing. Morning routine, gym routine, bedtime routine, etc. Thanks to routine or good habit, one doesn't spend too much time and energy on deciding on what/how to do it, saving energy for more important questions like "why". Routine is mostly a good thing for data scientist, too. Here's my routine for starting a new data science project in R, large or small: Create a github repo for the project with sensible name, all lowercase and dash, no underscore (~1min) git clone to my usual project directory (~/projects/) (30sec) Write README.md for what the project is about (~1min) Fire up Rstudio and create RStudio project (.Rproj) in the directory (~1min) Write the first R script, typically named initial-analysis.R First few lines of the scripts are almost always the same, like: library(tidyverse) df <- read_csv("datafile") glimpse(df) df %>% ggplot(aes(x, y)) + geom_.... : yes... this is where things start to diverge... So, that's about 10min to hit the ground running and start producing useful stuff. Once things start rolling, daily routines are similar: Bunch of data massaging, like: df %>% group_by(x) %>% filter(y %in% c("good", "fine")) %>% summarize(mz=median(z)) ... and visualization: df %>% ggplot(aes(x, y)) + geom_... + facet_wrap(~w) ... and reporting: rmarkdown::render("that-special-markdown.Rmd") ... and git commit / git push frequently. Talk to the stakeholders for questions, news, etc. But, overall, fairly automatic, fast, and effective. Yes, routine is mostly a good thing. What's your routine for starting a data science project in R? Very different from mine?? Let me (and the world) know!“따라 하며 배우는 데이터 과학” 출시2017-08-09T08:44:26+00:002017-08-09T08:44:26+00:00https://dataninja.me/2017/08/09/%EB%94%B0%EB%9D%BC-%ED%95%98%EB%A9%B0-%EB%B0%B0%EC%9A%B0%EB%8A%94-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EA%B3%BC%ED%95%99-%EC%B6%9C%EC%8B%9C<p>무사히 출판 되어 배송 중입니다. 배송 받으신 분들도 있고요.</p>
<p>교보문고 컴퓨터/IT 분야에서 12위라고 하니 관심에 감사드립니다! ( <a href="http://mobile.kyobobook.co.kr/showcase/book/KOR/9791185890869">교보문고 책 주문</a> )</p>
<p>정오표와 보충 내용을 계속 업뎃할 예정이니 책을 사신 분은 책의 <a href="https://www.facebook.com/dataninja.me/">페북 페이지</a>를 팔로우해주세요.</p>
<p>http://mobile.kyobobook.co.kr/showcase/bestseller/KOR?categoryCode=WEEK&linkClass=33&orderClick=Ols<br />
<img src="/assets/img_0270.jpg" /></p>무사히 출판 되어 배송 중입니다. 배송 받으신 분들도 있고요. 교보문고 컴퓨터/IT 분야에서 12위라고 하니 관심에 감사드립니다! ( 교보문고 책 주문 ) 정오표와 보충 내용을 계속 업뎃할 예정이니 책을 사신 분은 책의 페북 페이지를 팔로우해주세요. http://mobile.kyobobook.co.kr/showcase/bestseller/KOR?categoryCode=WEEK&linkClass=33&orderClick=OlsData Science DevOps and Docker2017-08-06T05:58:36+00:002017-08-06T05:58:36+00:00https://dataninja.me/2017/08/05/data-science-devops-and-docker<p>Data scientists sometimes have to (help) “productionize” their work, i.e. integrate data analysis, dashboard, and predictive modeling into a larger process or software pipeline. For example, imagine a system that (1) monitors for a data change, (2) triggers data analysis process whenever a change happens, and (3) takes the output of the analysis to show a webpage and/or store output parameters in a database for other systems to use.</p>
<p>Data scientists typically work in part (2), prototyping bunch of R or python codes. But when it’s time to build and deploy the system, integrating such data science codes is not trivial. A big challenge is that Data scientists' work environment (e.g. Macbook laptop with R and many, many, many R packages) is typically very different from a “deployment” environment (e.g. linux box in AWS EC2 or corp VMs). Installing R and bunch of dependency R libraries on the machine is frowned upon by ops and software engineers, since it’s usually a painful, fragile process.</p>
<p>In an ideal world, R / python codes data scientist developed on their laptop would “just work” when dropped on the deployment server(s). Too good to be true?? Well, that ideal world is here already thanks to the fantastic technology called “<a href="https://www.docker.com/">Docker</a>”. Using Docker, data science analysis and prototype could become super close to something that could be deployed very fast and efficiently. Just like devops helped developers productionize and operationalize their work better. We can even call it “data science devops”.<img class=" size-full wp-image-509 alignright" src="/assets/vertical_large.png" alt="vertical_large" width="286" height="237" /></p>
<p>Essentially, the first step to achieve data science devops consist of two practices:</p>
<ol>
<li><strong>Make the R codes into a command line script</strong> that could be executed via <a href="https://stat.ethz.ch/R-manual/R-devel/library/utils/html/Rscript.html">Rscript</a>, preferably with advanced option parsers like <a href="https://cran.r-project.org/web/packages/argparse/index.html">R argparse</a>. This has the added benefit of forcing <a href="https://en.wikipedia.org/wiki/Reproducibility">reproducibility</a>. Also data scientists are forced to think more in terms of API way and “<a href="https://en.wikipedia.org/wiki/Unix_philosophy">do one thing well</a>” (UNIX philosophy) mentality that lead to cleaner code structure.</li>
<li><b>Dockerize the R application</b>. Start with, e.g. <a href="https://hub.docker.com/r/rocker/verse/">rocker/verse</a> and <a href="https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/">add/modify</a><a href="https://github.com/rocker-org/rocker-versioned">Dockerfile</a> as needed.</li>
</ol>
<p>With the above two practices, on any machine with Docker installed, the app could be "deployed" like:</p>
<pre>$ docker pull your-org/my-r-app</pre>
<p>and could be run like:</p>
<pre>$ docker run your-org/my-r-app ARG1, ARG2, ...</pre>
<p>The beauty is that, it will run in ANY environment where Docker is installed: your Mac or Windows laptop, EC2 linux host, Your corp VM, and so on.</p>
<p>Is it actually easy? NO. You actually need to spend good ~100 hours or so to be at home writing your own Dockerfile with confidence. Is learning how to dockerize R app helpful? YES, very much so. Once you make the habit of developing your R analysis pipeline in a Dockerized, reproducible setting, your codes will be cleaner, more reproducible, and super easy to deploy. Your dev / ops coworkers will thank you. You and your team’s productivity will improve (YMMV).</p>
<p>So, dear fellow data scientists --- <a href="https://docs.docker.com/engine/installation/">install Docker</a> and <a href="https://docs.docker.com/get-started/">start learning how to use it;</a> Welcome to data science devops.</p>
<p> </p>Data scientists sometimes have to (help) “productionize” their work, i.e. integrate data analysis, dashboard, and predictive modeling into a larger process or software pipeline. For example, imagine a system that (1) monitors for a data change, (2) triggers data analysis process whenever a change happens, and (3) takes the output of the analysis to show a webpage and/or store output parameters in a database for other systems to use. Data scientists typically work in part (2), prototyping bunch of R or python codes. But when it’s time to build and deploy the system, integrating such data science codes is not trivial. A big challenge is that Data scientists' work environment (e.g. Macbook laptop with R and many, many, many R packages) is typically very different from a “deployment” environment (e.g. linux box in AWS EC2 or corp VMs). Installing R and bunch of dependency R libraries on the machine is frowned upon by ops and software engineers, since it’s usually a painful, fragile process. In an ideal world, R / python codes data scientist developed on their laptop would “just work” when dropped on the deployment server(s). Too good to be true?? Well, that ideal world is here already thanks to the fantastic technology called “Docker”. Using Docker, data science analysis and prototype could become super close to something that could be deployed very fast and efficiently. Just like devops helped developers productionize and operationalize their work better. We can even call it “data science devops”. Essentially, the first step to achieve data science devops consist of two practices: Make the R codes into a command line script that could be executed via Rscript, preferably with advanced option parsers like R argparse. This has the added benefit of forcing reproducibility. Also data scientists are forced to think more in terms of API way and “do one thing well” (UNIX philosophy) mentality that lead to cleaner code structure. Dockerize the R application. Start with, e.g. rocker/verse and add/modifyDockerfile as needed. With the above two practices, on any machine with Docker installed, the app could be "deployed" like: $ docker pull your-org/my-r-app and could be run like: $ docker run your-org/my-r-app ARG1, ARG2, ... The beauty is that, it will run in ANY environment where Docker is installed: your Mac or Windows laptop, EC2 linux host, Your corp VM, and so on. Is it actually easy? NO. You actually need to spend good ~100 hours or so to be at home writing your own Dockerfile with confidence. Is learning how to dockerize R app helpful? YES, very much so. Once you make the habit of developing your R analysis pipeline in a Dockerized, reproducible setting, your codes will be cleaner, more reproducible, and super easy to deploy. Your dev / ops coworkers will thank you. You and your team’s productivity will improve (YMMV). So, dear fellow data scientists --- install Docker and start learning how to use it; Welcome to data science devops. Lessons learned while Writing R Data Science Book2017-04-21T08:08:02+00:002017-04-21T08:08:02+00:00https://dataninja.me/2017/04/21/lessons-learned-while-writing-r-data-science-book<p>The publisher told me that my R data science book (in Korean! Not English, yet) manuscript is currently undergoing a second proofreading and it will be published around June after the final review. The process took me almost two years since 2015 when I was first asked to write it. I would like to share a few tips that I learned during the writing process.</p>
<h2 id="1-writing-a-book-is-hard">1. Writing a book is hard</h2>
<p>Understand that it is hard to write a book before accepting an offer to write one. I have written dozens of ~10-page <a href="https://scholar.google.com/citations?user=tVkKxokAAAAJ">academic papers</a> and am familiar with the paper writing process. That was why I accepted the challenge to write a long form book relatively easily, thinking “How different would writing a book be from writing a few papers?” The two turned out to be very different!</p>
<p>Academic papers are aimed at professional audience (hence, paradoxically, easier to write!), short, and the draft is completed typically within a week or two. Writing an academic paper is like running a 1 km race (Of course, one or two weeks only include “writing”; research work itself takes a lot longer).</p>
<p>On the other hand, books are typically aimed at general audience, long, and takes at least a few months to write. Writing a book is like running a marathon. It requires better planning, more disciplined pacing, and lots of patience and perseverance. It was a particularly challenging and humbling experience for me, who tend to work better in many small projects in bursts.</p>
<h2 id="2-get-feedback-early-on">2. Get feedback early on</h2>
<p>Seek feedback on your writing early and often. Feedback is essential for good writing in general. (See <a href="http://stevenpinker.com/publications/sense-style-thinking-persons-guide-writing-21st-century">“the Sense of Style”</a> for a great general introduction that touches on feedback and other guidances on good writing in general.) When writing a book, getting feedback has additional benefit of adding a few small milestones (make the whole process more incremental and agile). Also, feedbacks make book writing a lot more social and less lonely endeavor.</p>
<h2 id="3-collaborate-with-google-doc">3. Collaborate with Google doc</h2>
<p>Use online collaborative tools such as <a href="http://drive.google.com">Google Doc</a> for drafting. Google Docs lets you easily share your drafts with reviewers and receive feedback and comments in real time. Very efficient.</p>
<h2 id="4-version-control-your-codes">4. Version control your codes</h2>
<p>Version control computer codes for analysis and producing charts. <a href="https://cran.r-project.org/web/views/ReproducibleResearch.html">Reproducibility</a> is one of the main themes of my upcoming book. Use GitHub to track R, python, or other source codes. You can provide the codes as an appendix to the book. (The source code for my book is tracked <a href="https://github.com/Jaimyoung/ipds-kr">here</a>.) Some books are even available as a github repo, like <a href="https://github.com/hadley/adv-r">Advanced R</a> from Hadley Wickham!</p>
<h2 id="5-get-the-references-right-early-on">5. Get the references right early on</h2>
<p>Maintain the list of high-quality reference materials that are OK to use for publication. Many reference materials such as articles, diagrams, and photos could be found on the web. Record the URL, date and time of access of those materials. Check the copyright information to make sure the material is good to reproduce in a book. Materials on Wikipedia is typically OK to reference in most cases, but one should specify the source.</p>
<p>Here’s a challenging example: I was referencing “Aeron chair” in the book and sent a photo of one downloaded from Wikipedia to the publisher, but the picture quality was not good enough for publication. Publication in print needs resolution of 300 <a href="https://en.wikipedia.org/wiki/Dots_per_inch">dpi</a> at the very least. I started searching the web for photos of an Aeron chair that are both high quality and copyright-free, but such photos turned out to be very difficult to find! (If you have found such a picture, please let me know 🙂</p>
<p><img class="alignnone size-full wp-image-203" src="https://upload.wikimedia.org/wikipedia/commons/e/ec/Aeron_chair_JN.jpg" alt="aeron_chair_jn" width="322" height="345" /><br />
Photo: Aeron Chair. (Source: <a href="https://en.wikipedia.org/wiki/Aeron_chair">https://en.wikipedia.org/wiki/Aeron_chair</a>)</p>
<h2 id="6-get-the-image-format-right">6. Get the image format right</h2>
<p>Consult with the publisher to determine the size, resolution, and font size for charts.** In my case, the publisher was fine with <a href="https://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/png.html">PNG file</a>, but it took some trial and error and going back and forth to arrive at the following personal standards (See the R code snippet at the bottom):</p>
<ul>
<li>Size is 5.5 in. X 4 in.,</li>
<li>Resolution 600 dpi, and</li>
<li><a href="https://en.wikipedia.org/wiki/Point_(typography)">Text Point Size</a> = 9 (if you use <a href="https://www.rdocumentation.org/packages/ggplot2/versions/2.2.1/topics/ggsave">ggsave</a> function in R, you typically don’t need to worry about this at all)</li>
</ul>
<h2 id="wrapping-up">Wrapping up…</h2>
<p>That’s it. I wish I knew these 2 years ago.</p>
<p>Writing a book for the first time in my life was a lot harder than I had thought, but it was also a great experience. I <em>learned a lot</em> (including above tips), there’s a <em>sense of achievement</em> (like finishing a marathon<em>), and it’s *rewarding</em> to think that my work when published in June would <em>benefit some readers</em> getting better in the trade of data science. If you have the experience and knowledge to share with the world, you should definitely consider it!</p>
<h3 id="code-snippets">Code Snippets:</h3>
<p>a few R codes to export charts for publication quality. To use, specify the file names. Experiment different <strong>width, height, dpi</strong> options (in that order!) until you get satisfactory PNG file.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># A few useful lines to export plots</span><span class="w">
</span><span class="c1"># 1. base R graph</span><span class="w">
</span><span class="n">png</span><span class="p">(</span><span class="s2">"../../plots/.png"</span><span class="p">,</span><span class="w"> </span><span class="m">5.5</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">units</span><span class="o">=</span><span class="s1">'in'</span><span class="p">,</span><span class="w"> </span><span class="n">pointsize</span><span class="o">=</span><span class="m">9</span><span class="p">,</span><span class="w"> </span><span class="n">res</span><span class="o">=</span><span class="m">600</span><span class="p">)</span><span class="w">
</span><span class="c1"># Plot body</span><span class="w">
</span><span class="n">dev.off</span><span class="p">()</span><span class="w">
</span><span class="c1"># 2. single ggplot</span><span class="w">
</span><span class="c1"># Produce ggplot first. Then...</span><span class="w">
</span><span class="n">ggsave</span><span class="p">(</span><span class="s2">"../../plots/.png"</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">5.5</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="o">=</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">units</span><span class="o">=</span><span class="s1">'in'</span><span class="p">,</span><span class="w"> </span><span class="n">dpi</span><span class="o">=</span><span class="m">600</span><span class="p">)</span><span class="w">
</span><span class="c1"># 3. plot matrix from library(gridExtra)</span><span class="w">
</span><span class="c1"># Produce p1, p2, p3, p4 individual ggplot object</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">arrangeGrob</span><span class="p">(</span><span class="n">p</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">ggsave</span><span class="p">(</span><span class="s2">"../../plots/.png"</span><span class="p">,</span><span class="w"> </span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">5.5</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="o">=</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">units</span><span class="o">=</span><span class="s1">'in'</span><span class="p">,</span><span class="w"> </span><span class="n">dpi</span><span class="o">=</span><span class="m">600</span><span class="p">)</span></code></pre></figure>
<ul>
<li>Well, I have never run marathon.</li>
</ul>
<p>** This is English version of <a href="/2017/04/15/r-데이터-과학-서적을-집필하면서-배운-몇가지">this article</a>, originally in Korean.</p>The publisher told me that my R data science book manuscript is currently undergoing a second proofreading and it will be published around June after the final review. The process took me almost two years since July 2015 when I was first asked to write it. I would like to share a few tips that I learned during the writing process.R 데이터 과학 서적을 집필하면서 배운 몇가지2017-04-16T01:10:56+00:002017-04-16T01:10:56+00:00https://dataninja.me/2017/04/15/r-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EA%B3%BC%ED%95%99-%EC%84%9C%EC%A0%81%EC%9D%84-%EC%A7%91%ED%95%84%ED%95%98%EB%A9%B4%EC%84%9C-%EB%B0%B0%EC%9A%B4-%EB%AA%87%EA%B0%80%EC%A7%80<p>출판사로부터, 올 초 마무리한 R 데이터 과학 서적의 원고가 현재 2차 교정/교열 작업 중이며 최종 리뷰 후인 6월 경에는 출판될 것이라는 연락을 받았다. 애초에 원고 의뢰를 받은 것이 2015년 7월이었던 것을 생각하면 거의 이년의 시간이 걸린 셈이다. 집필 과정중에 배우고 느낀 몇가지 팁을 나누고자 한다.</p>
<ol>
<li><strong>서적 집필을 승낙하기 전에 서적 집필은 힘들다는 사실을 이해하고 깊이 고민해야 한다.</strong> 나는 지금까지 10-15페이지 분량의 논문들은 여러 번 써 보았고, 그러한 작업에 많이 익숙하기 때문에 “책이라고 많이 다를까?” 하고 쉽게 집필을 승락했다. 논문은 전문 독자를 대상으로 하며 (역설적으로 쓰기 쉽다), 분량이 짧으며, 1-2주일 내로 골격과 초고가 완성된다. 달리기라면 중거리 달리기라고 할 수 있다. (물론 논문 “쓰기”만을 이야기하는 것이다. 논문을 쓸만한 내용을 만들기 위한 연구 작업 자체는 당연히 포함하지 않는다) 이에 반해 저서는 대중적인 독자를 대상으로 하기 때문에 눈높이를 맞춰야 하고, 분량이 많으며 (현재 300페이지가 넘는다!), 집필에 보통 몇 달 이상이 걸린다. 마라톤에 비유할 수 있다. 이번 첫 저서 집필이 논문 열편을 쓰는 것보다 더 힘들었다고 할 수는 없다. 하지만 멀리 보며 숨을 고르고, 참을성을 가지고, 꾸준히 해야 하는 작업임을 몸으로 느꼈다. 조급한 성격인 나에게는 특히나 쉽지 않은 훈련이었다.</li>
<li><strong>집필 내용에 대한 피드백은 일찍, 자주 받도록 한다</strong>. 수시로 피드백을 받는 것은 좋은 글쓰기의 기본 소양이다. 독자의 눈높이에 맞추고, 내용을 개선하기 때문이다. 피드백의 또다른 이로운 점은 마라톤처럼 외로운 자신과의 싸움이 되기 쉬운 긴 글쓰기 과정을 외롭지 않고, 흥미진진하게 만들어 준다는 것이다.</li>
<li><strong>초고 작성에는 구글 닥 등의 협업 툴을 사용한다</strong>. 온라인 협업에 최적화된 구글 닥 등을 사용하면 쉽게 초고를 리뷰어와 나누고, 피드백과 코멘트를 받아서 반영할 수 있다.</li>
<li><strong>분석과 도표 생성은 재현가능하게 코드로 저장한다</strong>. 재현가능성, 즉 reproducibility 는 이번 책의 중요한 주제들 중 하나이다. GitHub 을 사용해 분석과 도표 작성 R 코드를 버전관리하면 좋다. 코드를 책의 부록으로 제공할 때도 좋다. (참고로 이번 책의 소스코드는 https://github.com/Jaimyoung/data-science-book-korean 에 올려 두었다.)</li>
<li><strong>인용 그림이나 자료 정보와 저작권을 미리 관리해둔다</strong>. 참고자료을 웹에서 가져 올 때는 URL 과 접속 일시를 기록해 두고, 저작권을 확인한다. 위키피디아 는 대부분의 경우 괜찮지만 출처를 명시하여야 한다. 참고로, 본서에서 “에어론 체어”(Aeron chair)를 인용하느라 위키피디아의 https://en.wikipedia.org/wiki/Aeron_chair 기사의 사진을 출판사에 보내었는데 사진의 해상도가 충분하지 않았다. 그런데 웹상에서 저작권이 걸려있지 않은 고해상도의 사진을 찾는 것이 참으로 힘들었다! (독자중 그러한 사진을 찾은 분은 연락주시기 바란다)</li>
<li><strong>도표는 출판사와 상의하여 크기, 해상도, 폰트 사이즈 등을 정해야 한다.</strong> 참고로 이번 출판사에서는 PNG 화일을 사용해도 좋고 크기는 5.5 in. X 4 in., 해상도는 600 dpi, 글자 Text Point Size = 9 정도가 적당하다고 하여서, 도표 출력 코드의 파라메타를 결정할 수 있었다. 이 책 하단의 R 코드를 참조하자.</li>
</ol>
<p>위의 팁들을 책 집필 초반에 알았더라면 작업이 좀 더 수월했을 것이다.</p>
<p>어쨌건 평생 처음 해 본 책 집필은 생각보다 훨씬 힘든 작업이었지만 또한 많은 것을 배운 좋은 경험이었다. 사람들과 나눌만한 경험이나 지식이 있는 사람이라면 평생 한번은 도전해 볼것을 권한다.</p>
<p>6월에 마침내 책이 출판되면 데이터 과학에 관심있는 한사람에게라도 도움이 되길 바란다.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># A few useful lines to export plots</span><span class="w">
</span><span class="c1"># 1. base R graph</span><span class="w">
</span><span class="n">png</span><span class="p">(</span><span class="s2">"../../plots/.png"</span><span class="p">,</span><span class="w"> </span><span class="m">5.5</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">units</span><span class="o">=</span><span class="s1">'in'</span><span class="p">,</span><span class="w"> </span><span class="n">pointsize</span><span class="o">=</span><span class="m">9</span><span class="p">,</span><span class="w"> </span><span class="n">res</span><span class="o">=</span><span class="m">600</span><span class="p">)</span><span class="w">
</span><span class="c1"># Plot body</span><span class="w">
</span><span class="n">dev.off</span><span class="p">()</span><span class="w">
</span><span class="c1"># 2. single ggplot</span><span class="w">
</span><span class="c1"># Produce ggplot first. Then...</span><span class="w">
</span><span class="n">ggsave</span><span class="p">(</span><span class="s2">"../../plots/.png"</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">5.5</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="o">=</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">units</span><span class="o">=</span><span class="s1">'in'</span><span class="p">,</span><span class="w"> </span><span class="n">dpi</span><span class="o">=</span><span class="m">600</span><span class="p">)</span><span class="w">
</span><span class="c1"># 3. plot matrix from library(gridExtra)</span><span class="w">
</span><span class="c1"># Produce p1, p2, p3, p4 individual ggplot object</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">arrangeGrob</span><span class="p">(</span><span class="n">p</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">ggsave</span><span class="p">(</span><span class="s2">"../../plots/.png"</span><span class="p">,</span><span class="w"> </span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">5.5</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="o">=</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">units</span><span class="o">=</span><span class="s1">'in'</span><span class="p">,</span><span class="w"> </span><span class="n">dpi</span><span class="o">=</span><span class="m">600</span><span class="p">)</span></code></pre></figure>
<p>** 후기 (4/22): 결국 에어론 의자 사진은 <a href="https://goo.gl/photos/edTEBuqRDGxXVCz7A">직접 찍었습니다</a>.</p>출판사로부터, 올 초 마무리한 R 데이터 과학 서적의 원고가 현재 2차 교정/교열 작업 중이며 최종 리뷰 후인 6월 경에는 출판될 것이라는 연락을 받았다. 애초에 원고 의뢰를 받은 것이 2015년 7월이었던 것을 생각하면 거의 이년의 시간이 걸린 셈이다. 집필 과정중에 배우고 느낀 몇가지 팁을 나누고자 한다.