Check out R-bloggers for more excellent content!

Removing Redundant Hostnames with NGINX

2017-09-15     NGINX

While poring over my Google Analytics data I noticed the notification below. Obviously this is not a train smash, but it is compromising the quality of my data. And it also offends my OCD. This is what I did to fix the problem. Read more »

Creating a S3 Bucket

2017-09-14     AWS

There are many good reasons to use S3 (Simple Storage Service) storage. This is a quick overview of how to create a S3 bucket. Read more »

Installing Docker on Ubuntu

2017-09-14     Docker Linux

This procedure works on both my laptop and a fresh EC2 instance. Read more »

Hosting a Plumber API on AWS

2017-09-14     AWS R Plumber

I’ve been putting together a small proof-of-concept API using R and plumber. It works flawlessly on my local machine and I was planning on deploying it on an EC2 instance to demo it for a client. However, I ran into a snag: despite opening the required port in my Security Group I was not able to access the API. This is what I needed to do to get it working. Read more »

Creating an AWS Spot Instance

2017-09-13     AWS

EC2 Spot Instances can provide very affordable computing on EC2 by allowing access to unused capacity at significant discounts. Read more »

Building a Local OSRM Instance

2017-09-11     R OSRM

The Open Source Routing Machine (OSRM) is a library for calculating routes, distances and travel times between spatial locations. It can be accessed via either an HTTP or C++ API. Since it’s open source you can also install locally, download appropriate map data and start making efficient travel calculations. These are the instructions for getting OSRM installed on a Ubuntu machine and hooking up the osrm R package. Read more »

Global Variables in R Packages

2017-09-07     R

I know that global variables are from the Devil, but sometimes you just can’t get around them. I’m building a small package for a client that relies on a data file. For various reasons that file is not part of the package and can reside in different locations on users’ machines. Furthermore there are users on both Windows and Linux machines. Read more »

Driving AWS from the Command Line

2017-08-31     AWS

Although it’s very handy (and easy) to set up some cloud resources using the AWS Management Console, once you know what you need it makes a lot of sense to automate the process. Fortunately there’s a handy little command line tools, aws, which makes this eminently possible. The AWS CLI Command Reference is the definitive resource for this tool. There’s a mind boggling array of possibilities. We’ll take a look at a small selection of them. Read more »

Route Asymmetry in Google Maps

2017-08-23     R

I have been retrieving some route information using Rodrigo Azuero’s gmapsdistance package and noted that there was some asymmetry in the results: the time and distance for the trip from A to B was not necessarily always the same as the time and distance for the trip from B to A. Although in retrospect this seems self-evident, it merited further investigation. Read more »

Retrieving Kaggle Data from the Command Line

2017-08-21     Kaggle AWS

We’ve been building some models for Kaggle competitions using an EC2 instance for compute. I initially downloaded the data locally and then pushed it onto EC2 using SCP. But there had to be a more efficient way to do this, especially given the blazing fast bandwidth available on AWS. Enter kaggle-cli. Update: Apparently kaggle-cli has been deprecated in favour of kaggle-api. More information below. Read more »

Setting Up Time Zones in BASH

2017-08-20     BASH

Ensuring that your account is configured to run with appropriate time zone information can make your life a lot easier. Of course, if you administer your own system then you can simply set your system time to local time. However, it’s generally a better idea to set system time to Universal Time (UTC) and then configure time zone information on a per-user basis. Why does this make sense? Well, suppose that you have remote users logging onto your system. It’s very likely that a remote user will be operating in a different time zone and it’d be handy for them to have system time converted into their local time. Read more »

Setting Up Time Zones in MySQL

2017-08-20     MySQL Django

I’m in the process of setting up a Zinnia blog on one of my Django sites. After putting all of the necessary plumbing in place I got the following message on first visiting the blog URL: Database returned an invalid value in QuerySet.datetimes(). Are time zone definitions for your database and pytz installed? The solution to this is to copy your system’s time zone information across to the database. Read more »

Adding a Volume to an Ubuntu EC2 Instance

2017-08-10     AWS

Some quick notes on adding a storage volume to an EC2 instance. Read more »

Remote Desktop on an Ubuntu EC2 Instance

2017-08-08     AWS

A couple of options for remote access to desktop applications on a EC2 host. Read more »

A Timeline History of R

2017-08-05     R

A record of some more or less important events in the history of R. This is a work in progress. The information is cobbled together from a range of sources. If you have pertinent items to add, please let me know via the comments. Read more »

Adding Users to an EC2 Ubuntu Instance

2017-07-24     AWS Linux SSH

By default an EC2 instance has only a single user other than root. For example, on a Ubuntu instance, that user is ubuntu. If there will be multiple people accessing the instance then it’s generally necessary for each of them to have their own account. Setting this up is pretty simple, it just requires sorting out some authentication details. Read more »

Favourite Talks from useR 2017

2017-07-23     R web scraping Docker Teaching QGIS

Docker: Persisting User Data

2017-07-20     Docker

I’m busy putting together a Docker image for a multi-user Jupyter Notebook installation. I am to have an independent login for each of the users and each of them should also have their own storage space. That space should exist elsewhere from on the container though, so that even if the container stops, the data lives on. This should mitigate user rage. Read more »

Deploying Jupyer on AWS using Docker

2017-07-18     Jupyter Docker AWS

Amazon’s EC2 Container Services (ECS) is an orchestrated system for deploying Docker containers on AWS. This post is about not using ECS. -- Read more »

RStudio Environment on DigitalOcean with Docker

2017-07-11     R Docker

I’ll be running a training course in a few weeks which will use RStudio as the main computational tool. Since it’s a short course I don’t want to spend a lot of time sorting out technical issues. And with multiple operating systems (and versions) these issues can be numerous and pervasive. Setting up a RStudio server which everyone can access (and that requires no individual configuration!) makes a lot of sense. Read more »

Accessing PySpark from a Jupyter Notebook

2017-07-04     Jupyter Spark

It’d be great to interact with PySpark from a Jupyter Notebook. This post describes how to get that set up. It assumes that you’ve installed Spark like this. Install the findspark package. $ pip3 install findspark Make sure that the SPARK_HOME environment variable is defined Launch a Jupyter Notebook. $ jupyter notebook Import the findspark package and then use findspark.init() to locate the Spark process and then load the pyspark module. Read more »

Installing Hadoop on Ubuntu

2017-07-04     Linux Hadoop

This is what I did to set up Hadoop on my Ubuntu machine. Read more »

Installing Spark on Ubuntu

2017-07-04     Linux Spark

I’m busy experimenting with Spark. This is what I did to set up a local cluster on my Ubuntu machine. Before you embark on this you should first set up Hadoop. Read more »

Increasing MySQL Packet Maximum Size

2017-07-01     MySQL

In the process of uploading a massive CSV file to my Django application my session data are getting pretty big. As a the result I’m getting these errors: (1153, "Got a packet bigger than 'max_allowed_packet' bytes") and (2006, 'MySQL server has gone away'). The second error is potentially unrelated. After some research it became apparent that the source of the problem is my max_allowed_packet setting. A quick check to find the current value: Read more »

Setting up ExpressVPN on Ubuntu

2017-06-23     Linux

I’ve been meaning to set up a VPN and this morning seemed like a good time to tick it off the bucket list. This is a quick outline of my experience, which included one minor hiccup. Read more »

Setting up Jupyter with Python 3 on Ubuntu

2017-06-23     Jupyter Linux

A short note on how to set up Jupyter Notebooks with Python 3 on Ubuntu. The instructions are specific to Xenial Xerus (16.04) but are likely to be helpful elsewhere too. Read more »

Deploying a Minimal API using plumber on DigitalOcean

2017-06-21     R

RSelenium and Java Heap Space

2017-06-09     R Web Scraping Selenium

I’m in the process of deploying a scraper on a DigitalOcean instance. The scraper uses RSelenium with the PhantomJS browser. I ran into a problem though. Although it worked flawlessly on my local machine, on the remote instance it broke with the following error: Selenium message:Java heap space Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command. class: java.lang.OutOfMemoryError Further Details: run errorDetails method Execution halted Clearly Java a memory issue. Read more »

Web Dev to Data Science


This infographic, which looks at the overlap between Web Dev and Data Science, appeared recently on the DataCamp blog. Since I’m currently straddling these two disciplines it seems rather appropriate. Read more »

Comrades Marathon 2017