Retail Data: R Package

Have you ever noticed how things seem to get really expensive at specific times of the year? Like Mother’s Day and Valentine’s Day? Have you ever felt a bit ripped off when buying an over-priced bouquet of flowers or box of chocolates? Have you ever wondered just how much those prices have been inflated? Of course you have! But it’s always been a niggling suspicion, never a fact. Where’s the evidence?

R Package for @racently

An R wrapper for the @racently API.

Durban EDGE DataQuest

A couple of quick R starter scripts for the Durban EDGE DataQuest.

An API for @racently

Retrieving running data using the @racently API.

Scraping Machinery Parts

Scraping prices from a supplier of replacement parts for heavy machinery.

Installing Prophet on CentOS

How to install the Prophet package for R on RHEL or CentOS.

Private Security and the Pareto Principle

Private Security is a big industry in South Africa. Most Private Security companies promise to provide a rapid response to every callout generated by any of their customers. There is a delicate balance between the number of response vehicles and the number of customers (and the frequency of their callouts!), which determines whether or not they are able to honour this promise. On the one hand, more response vehicles result in lower response times.

R, Docker and Checkpoint: A Route to Reproducibility

I need to deploy Shiny on a Windows machine. I also need to use {checkpoint} for package management. Using Docker seems to be the only reasonable approach to Shiny on Windows. But how easy would it be to also factor {checkpoint} into this setup? Only one reasonable way to find out: give it a try. Below is the simple Dockerfile I used. Here are the fundamental components of what it does:

All Roads Lead to Rome

I was inspired by this visualisation, showing the optimal routes (by car) from the geographic centre of the USA to all counties. The proverb “All Roads Lead to Rome” immediately came to mind and I set out to hack together something along that theme. This is what was required: Find a list of major cities in Europe and Asia. Use OSRM to generate routes from each of these cities to Rome.

Recreating 'Unknown Pleasures' graphic

For some time I’ve wanted to recreate the cover art from Joy Division’s Unknown Pleasures album. The visualisation depicts successive pulses from the pulsar PSR B1919+21, discovered by Jocelyn Bell in 1967. Album art. Data The first obstacle was acquiring the data. I found a D3 visualisation by Mike Bostock. This in turn pointed me to a CSV file in a gist belonging to @borgar. After reading the CSV data into pulsar I applied some light wrangling (the raw data is a matrix).

Comrades Marathon (2019) Splits

I’m looking at ways to effectively visualise the splits data for the 2019 edition of the Comrades Marathon. My objectives are to provide: an overall view of the splits across the entire field and a detailed view for individual runners (relative to the rest of the field). Ridge Plot My working solution for visualising the global splits data is a ridgeline plot created with the {ggridges} package.

Medal Breakdown at Comrades Marathon (2019)

A quick breakdown of the medal distribution at the 2019 edition of the Comrades Marathon. This is what the medal categories correspond to: Gold — first 10 men and women Wally Hayward (men) — 11th position to sub-6:00 Isavel Roche-Kelly (women) — 11th position to sub-7:30 Silver (men) — 6:00 to sub-7:30 Bill Rowan — 7:30 to sub-9:00 Robert Mtshali — 9:00 to sub-10:00 Bronze — 10:00 to sub-11:00 and Vic Clapham — 11:00 to sub-12:00.

Comrades Marathon (2019) Start Delay

How long does it take to cross the start line at the Comrades Marathon? If you’re lucky enough to be starting in one of the batches which is close to the front then this might be a matter of seconds to a couple of minutes. But if you’re in a batch closer to the back then this could be anything up to ten or eleven minutes. This is an agonising wait when all you want to do is start running.

A Shiny Comrades Marathon Pacing App

The Comrades Marathon is an epic ultramarathon run each year between Durban and Pietermaritzburg (South Africa). A few years ago I put together a simple spreadsheet for generating a Comrades Marathon pacing strategy. But the spreadsheet was clunky to use and laborious to maintain. Plus I was frustrated by the crude plots (largely due to my limited spreadsheet proficiency). It seemed like an excellent opportunity to create a Shiny app.

emayili: Sending Email from R

At Exegetic we do a lot of automated reporting with R. Being able to easily and reliably send emails is a high priority. There is already a selection of packages for sending email from R: {mailR} {gmailr} {blastula} {blatr} (Windows) {mail} and {sendmailR}. We’ve had the most experience with the first two, both of which are really solid packages. However, {gmailr} uses the Google Mail API so it doesn’t work with all SMTP servers and {mailR} has a dependency on {rJava} which can be a bit of a hurdle for deploying in some environments.

Setting up an R Admin Group

When I set up an R server for clients they often want to be able to install packages so that all users on the machine have access to them. This requires them to be able to install the packages onto the root filesystem rather that under their individual home directories. It would be easy enough to give them su access, but this is a risky approach. There are so many other things on the system that they could break with this level of power.

Integrating Qlik Sense and R

Components Qlik Sense is a tool for exploratory data analysis and visualisation. It’s powerful and versatile. It’s can, however, be significantly enhanced by interfacing with R. Qlik Sense does not currently integrate directly with R. However, it’s not too tricky to get the two systems talking to each other. We’ll need two things to make this happen: Rserve — A TCP/IP server which allows other programs to use R without initialising a separate R process or linking against an R library; and SSE R-plugin — A server-side extension (SSE) which provides the interface between Qlik Sense and Rserve.

Docker Images for R: r-base versus r-apt

I need to deploy a Plumber API in a Docker container. The API has some R package dependencies which need to be baked into the Docker image. There are a few options for the base image: r-base tidyverse or r-apt. The first option, r-base, would require building the dependencies from source, a somewhat time consuming operation. The last option, r-apt, makes it possible to install most packages using apt, which is likely to be much quicker.

RServe: Getting Started

Rserve is a server which allows other programs to use the facilities of R via TCP/IP. Installing Since Rserve gets installed to system folders, you need to do the install as the root user. # Become root. $ sudo su # Run R as root. $ R> install.packages("Rserve") Running To launch as a daemon. $ R CMD Rserve To launch in debug mode. $ R CMD Rserve.

JSON Payload for POST Request

Starting with JSON body because this is the way that most API documentation will give you the payload examples. body = '{ "filters": { "keywords": ["money","government"], "award_type_codes": [ "A", "B", "C", "D" ] }, "fields": [ "Award ID", "Mod", "Recipient Name", "Action Date", "Transaction Amount", "Awarding Agency", "Awarding Sub Agency", "Award Type" ], "page": 1, "limit": 35, "sort": "Transaction Amount", "order": "desc" }' library(httr) Send the body as a JSON string.

Where does .Renviron live on Citrix?

At one of my clients I run RStudio under Citrix in order to have access to their data. For the most part this works fine. However, every time I visit them I spend the first few minutes of my day installing packages because my environment does not seem to be persisted from one session to the next. I finally had a gap and decided to fix the problem. Where are the packages being installed?

Survey Raking: An Illustration

Analysing survey data can be tricky. There’s often a mismatch between the characteristics of the survey respondents and and those of the general population. If the discrepancies are not accounted for then the survey results can (and generally will!) be misleading. A common approach to this problem is to weight the individual survey responses so that the marginal proportions of the survey are close to those of the population. Raking (also known as proportional fitting, sample-balancing, or ratio estimation) is a technique for generating the required weights.

Scraping the Turkey Accordion

One of the things I like most about web scraping is that almost every site comes with a new set of challenges. The Accordion Concept I recently had to scrape a few product pages from the site of a large retailer. I discovered that these pages use an “accordion” to present the product attributes. Only a single panel of the accordion is visible at any one time. So, for example, you toggle the Details panel open to see the associated content.

Installing RStudio & Shiny Servers

I did a remote install of Ubuntu Server today. This was somewhat novel because it’s the first time that I have not had physical access to the machine I was installing on. The server install went very smoothly indeed. The next tasks were to install RStudio Server and Shiny Server. The installation process for each of these is well documented on the RStudio web site: Installing RStudio Server and Installing Shiny Server.

Diagnosing RStudio Startup Issues

Yesterday I tried to start RStudio and something weird happened: the window launched but it was blank and unresponsive. I tried dpkg --remove and then re-installed. Same problem. I tried dpkg --remove followed by dpkg --purge and then re-installed. Same problem. I renamed by .R folder. Still the same problem. A sense of desperation was beginning to set in: most of my projects rely on RStudio. After trying a selection of other options I consulted the Internet Oracle and learned that I could get additional diagnostics using

What's New in R 3.5.0?

A complete list of the changes in R 3.5.0 can be found here. I’m picking out two (personal) highlights here.

Updating R on Ubuntu

Today I finally got around to updating my R to 3.5 (or, more specifically, 3.5.1). The complete instructions for doing the update on Ubuntu are available here. I’ve paraphrased them below.

eRum (2018) Top Twenty

My Top 20 highlights about eRum (2018) in Budapest.

Travelling Salesman with ggmap

I’ve been testing out some ideas around the Travelling Salesman Problem using TSP and ggmap.

Classification: Get the Balance Right

For classification problems the positive class (which is what you’re normally trying to predict) is often sparsely represented in the data. Unless you do something to address this imbalance then your classifier is likely to be rather underwhelming. Achieving a reasonable balance in the proportions of the target classes is seldom emphasised. Perhaps it’s not very sexy. But it can have a massive effect on a model.

Tips for Lightning Talks

It seems a little counter-intuitive, but a 5 minute lightning talk is far more difficult to prepare (and present!) than a standard 20 minute or longer talk. The principle challenge is fitting everything that you want to say into the allotted time, while still maintaining an engaging narrative. At the recent satRday conference in Cape Town (17 March 2018) we had a number of great lightning talks. A few of the speakers gave us their tips on creating a brilliant lightning talk.

Installing rJava on Ubuntu

Installing the rJava package on Ubuntu is not quite as simple as most other R packages. Some quick notes on how to do it.

Variable Names: Camel Case to Underscore Delimited

A project I’m working on has a bunch of different data sources. Some of them have column names in Camel Case. Others are underscore delimited. My OCD rebels at this disarray and demands either one or the other. If it were just a few columns and I was only going to have to do this once, then I’d probably just quickly do it by hand. But there are many columns and it’s very likely that there’ll be more data in the future and the process will need to be repeated. Seems like something that should be easy to automate.

Analysis of Feedback from satRday [Cape Town] 2017

We recently announced the second satRday (Cape Town) conference scheduled to take place on 17 March 2018. Obviously we want this to be bigger and better than this year’s event, so we are paying careful attention to the feedback that we received from the first event. This is a quick analysis of the feedback. We sold 192 tickets and gave out 11 complimentaries to the event. There were 107 responses to the feedback survey, which means that we heard back from more than half of the people who attended, which is hopefully a representative sample.

Durban Twitter Analysis

I was invited to give a talk at Digifest (Durban University of Technology) on 10 November 2017. Looking at the other speakers and talks on the programme I realised that my normal range of topics would not be suitable. I needed to do something more in line with their mission to “celebrate the creative spirit through multimedia projects from disciplines such as visual and performing arts” and to promote “collaboration across art, science and technology”. Definitely outside my current domain, but consistent with many of the things that I have been aspiring to. To be honest, I was pleased to be invited, but when I sat down to consider what I would talk about, I found myself at a loss. I’m not currently engaged in anything that ticks many of those boxes. But I am loathe to turn down an opportunity to speak. So I made a plan. In retrospect it was not a terribly good plan. But it was workable. I decided to speak about gauging sentiment relating to the city of Durban using data from Twitter. This post touches on some of my results.

Hosting a Plumber API on AWS

I’ve been putting together a small proof-of-concept API using R and plumber. It works flawlessly on my local machine and I was planning on deploying it on an EC2 instance to demo it for a client. However, I ran into a snag: despite opening the required port in my Security Group I was not able to access the API. This is what I needed to do to get it working.

Building a Local OSRM Instance

The Open Source Routing Machine (OSRM) is a library for calculating routes, distances and travel times between spatial locations. It can be accessed via either an HTTP or C++ API. Since it’s open source you can also install locally, download appropriate map data and start making efficient travel calculations. These are the instructions for getting OSRM installed on a Ubuntu machine and hooking up the osrm R package.

Installing MicroPython on a ESP-32

STUFF FROM THE PYCON TEAM: An excellent reference is the official MicroPython one here. Install CP210x drivers Find out COM port (in this case COM3) https://www.mathworks.com/help/supportpkg/arduinoio/ug/find-arduino-port-on-windows-mac-and-linux.html esptool.py –chip esp32 –port COM3 erase_flash wget http://micropython.org/resources/firmware/esp32-20180511-v1.9.4.bin Note that this is now an old firmware esptool.py –chip esp32 –port COM3 write_flash -z 0x1000 esp32-20180511-v1.9.4.bin Putty over serial to COM3 with baud rate 115200 import webrepl_setup On flash, the AP mode was not enabled import network sta_if = network.

Global Variables in R Packages

I know that global variables are from the Devil, but sometimes you just can’t get around them. I’m building a small package for a client that relies on a data file. For various reasons that file is not part of the package and can reside in different locations on users’ machines. Furthermore there are users on both Windows and Linux machines.

Route Asymmetry in Google Maps

I have been retrieving some route information using Rodrigo Azuero’s gmapsdistance package and noted that there was some asymmetry in the results: the time and distance for the trip from A to B was not necessarily always the same as the time and distance for the trip from B to A. Although in retrospect this seems self-evident, it merited further investigation.

A Timeline History of R

A record of some more or less important events in the history of R. This is a work in progress. The information is cobbled together from a range of sources. If you have pertinent items to add, please let me know via the comments.

Favourite Talks from useR 2017

RStudio Environment on DigitalOcean with Docker

I’ll be running a training course in a few weeks which will use RStudio as the main computational tool. Since it’s a short course I don’t want to spend a lot of time sorting out technical issues. And with multiple operating systems (and versions) these issues can be numerous and pervasive. Setting up a RStudio server which everyone can access (and that requires no individual configuration!) makes a lot of sense. These are some notes about how I got this all set up using a Docker container on DigitalOcean.

Deploying a Minimal API using plumber on DigitalOcean

RSelenium and Java Heap Space

I’m in the process of deploying a scraper on a DigitalOcean instance. The scraper uses RSelenium with the PhantomJS browser. I ran into a problem though. Although it worked flawlessly on my local machine, on the remote instance it broke with the following error: Selenium message:Java heap space Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command. class: java.lang.OutOfMemoryError Further Details: run errorDetails method Execution halted Clearly Java a memory issue.

Relationship between Race Distance and Gender Ratio

satRday Cape Town: Call for Submissions

satRday Cape Town will happen on 18 February 2017 at Workshop 17, Victoria & Alfred Waterfront, Cape Town, South Africa.

Fitting a Statistical Distribution to Sampled Data

I’m generally not too interested in fitting analytical distributions to my data. With large enough samples (which I am normally fortunate enough to have!) I can safely assume normality for most statistics of interest. Recently I had a relatively small chunk of data and finding a decent analytical approximation was important. So I had a look at the tools available in R for addressing this problem. The fitdistrplus package seemed like a good option.

PLOS Subject Keywords: Association Rules

In a previous post I detailed the process of compiling data on subject keywords used in articles published in PLOS journals. In this instalment I’ll be using those data to mine Association Rules with the arules package. Good references on the topic of Association Rules are Section 14.2 of The Elements of Statistical Learning (2009) by Hastie, Tibshirani and Friedman; and Introduction to arules by Hahsler, Grün, Hornik and Buchta.

ubeR: A Package for the Uber API

Uber exposes an extensive API for interacting with their service. ubeR is a R package for working with that API which Arthur Wu and I put together during a Hackathon at iXperience. Installation The package is currently hosted on GitHub. Installation is simple using the devtools package. > devtools::install_github("DataWookie/ubeR") > library(ubeR) Authentication To work with the API you’ll need to create a new application for the Rides API. Set Redirect URL to http://localhost:1410/.