Check out R-bloggers for more excellent content!

A Timeline History of R

2017-08-05     R

A record of some more or less important events in the history of R. This is a work in progress. The information is cobbled together from a range of sources. If you have pertinent items to add, please let me know via the comments. Read more »

Adding Users to an EC2 Ubuntu Instance

2017-07-24     AWS Linux SSH

By default an EC2 instance has only a single user other than root. For example, on a Ubuntu instance, that user is ubuntu. If there will be multiple people accessing the instance then it’s generally necessary for each of them to have their own account. Setting this up is pretty simple, it just requires sorting out some authentication details. Read more »

Favourite Talks from useR 2017

2017-07-23     R web scraping Docker Teaching QGIS

Docker: Persisting User Data

2017-07-20     Docker

I’m busy putting together a Docker image for a multi-user Jupyter Notebook installation. I am to have an independent login for each of the users and each of them should also have their own storage space. That space should exist elsewhere from on the container though, so that even if the container stops, the data lives on. This should mitigate user rage. Read more »

Deploying Jupyer on AWS using Docker

2017-07-18     Jupyter Docker AWS

Amazon’s EC2 Container Services (ECS) is an orchestrated system for deploying Docker containers on AWS. This post is about not using ECS. -- Read more »

RStudio Environment on DigitalOcean with Docker

2017-07-11     R Docker

I’ll be running a training course in a few weeks which will use RStudio as the main computational tool. Since it’s a short course I don’t want to spend a lot of time sorting out technical issues. And with multiple operating systems (and versions) these issues can be numerous and pervasive. Setting up a RStudio server which everyone can access (and that requires no individual configuration!) makes a lot of sense. Read more »

Accessing PySpark from a Jupyter Notebook

2017-07-04     Jupyter Spark

It’d be great to interact with PySpark from a Jupyter Notebook. This post describes how to get that set up. It assumes that you’ve installed Spark like this. Install the findspark package. $ pip3 install findspark Make sure that the SPARK_HOME environment variable is defined Launch a Jupyter Notebook. $ jupyter notebook Import the findspark package and then use findspark.init() to locate the Spark process and then load the pyspark module. Read more »

Installing Hadoop on Ubuntu

2017-07-04     Linux Hadoop

This is what I did to set up Hadoop on my Ubuntu machine. Read more »

Installing Spark on Ubuntu

2017-07-04     Linux Spark

I’m busy experimenting with Spark. This is what I did to set up a local cluster on my Ubuntu machine. Before you embark on this you should first set up Hadoop. Read more »

Increasing MySQL Packet Maximum Size

2017-07-01     MySQL

In the process of uploading a massive CSV file to my Django application my session data are getting pretty big. As a the result I’m getting these errors: (1153, "Got a packet bigger than 'max_allowed_packet' bytes") and (2006, 'MySQL server has gone away'). The second error is potentially unrelated. After some research it became apparent that the source of the problem is my max_allowed_packet setting. A quick check to find the current value: Read more »

Setting up ExpressVPN on Ubuntu

2017-06-23     Linux

I’ve been meaning to set up a VPN and this morning seemed like a good time to tick it off the bucket list. This is a quick outline of my experience, which included one minor hiccup. Read more »

Setting up Jupyter with Python 3 on Ubuntu

2017-06-23     Jupyter Linux

A short note on how to set up Jupyter Notebooks with Python 3 on Ubuntu. The instructions are specific to Xenial Xerus (16.04) but are likely to be helpful elsewhere too. Read more »

Deploying a Minimal API using plumber on DigitalOcean

2017-06-21     R

RSelenium and Java Heap Space

2017-06-09     R Web Scraping Selenium

I’m in the process of deploying a scraper on a DigitalOcean instance. The scraper uses RSelenium with the PhantomJS browser. I ran into a problem though. Although it worked flawlessly on my local machine, on the remote instance it broke with the following error: Selenium message:Java heap space Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command. class: java.lang.OutOfMemoryError Further Details: run errorDetails method Execution halted Clearly Java a memory issue. Read more »

Web Dev to Data Science

2017-05-11

This infographic, which looks at the overlap between Web Dev and Data Science, appeared recently on the DataCamp blog. Since I’m currently straddling these two disciplines it seems rather appropriate. Read more »

Comrades Marathon 2017

2017-04-30

Clustering Time Series Data

2017-04-25     Machine Learning

I have been looking at methods for clustering time domain data and recently read TSclust: An R Package for Time Series Clustering by Pablo Montero and José Vilar. Here are the results of my initial experiments with the TSclust package. Read more »

Bulgaria Web Summit

2017-04-16     Conference

The Bulgaria Web Summit happened on 7 and 8 April 2017 at the Inter Expo Center in Sofia, Bulgaria. Read more »

Relationship between Race Distance and Gender Ratio

2017-04-09     R

Google Quick, Draw!

2016-11-17

Spent a very diverting few minutes playing with Quick, Draw! this morning, which is one of the cool AI Experiments hosted by Google. Read more »

Simple School Maths Problem

2016-11-15

A simple problem sent through to me by one of my running friends: There are 6 red cards and 1 black card in a box. Busi and Khanha take turns to draw a card at random from the box, with Busi being the first one to draw. The first person who draws the black card will win the game (assume that the game can go on indefinitely). If the cards are drawn with replacement, determine the probability that Khanya will win, showing all working. Read more »

satRday Cape Town: Call for Submissions

2016-10-26     R Conference

satRday Cape Town will happen on 18 February 2017 at Workshop 17, Victoria & Alfred Waterfront, Cape Town, South Africa. Read more »

Zeynep Tufekci: Machine intelligence and human morals

2016-10-24     Machine Learning TED Talk

fast-neural-style: Real-Time Style Transfer

2016-10-07     Machine Learning

I followed up a reference to fast-neural-style from Twitter and spent a glorious hour experimenting with this code. Very cool stuff indeed. It’s documented in Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Justin Johnson, Alexandre Alahi and Fei-Fei Li. The basic idea is to use feed-forward convolutional neural networks to generate image transformations. The networks are trained using perceptual loss functions and effectively apply style transfer. What is “style transfer”? Read more »

Fitting a Statistical Distribution to Sampled Data

2016-10-05     R

I’m generally not too interested in fitting analytical distributions to my data. With large enough samples (which I am normally fortunate enough to have!) I can safely assume normality for most statistics of interest. Recently I had a relatively small chunk of data and finding a decent analytical approximation was important. So I had a look at the tools available in R for addressing this problem. The fitdistrplus package seemed like a good option. Read more »

Talks about Bots

2016-10-04     Machine Learning

Seth Juarez and Matt Winkler having an informal chat about bots. Matt Winkler talking about Bots as the Next UX: Expanding Your Apps with Conversation at the Microsoft Machine Learning & Data Science Summit (2016). At the confluence of the rise in messaging applications, advances in text and language processing, and mobile form factors, bots are emerging as a key area of innovation and excitement. Bots (or conversation agents) are rapidly becoming an integral part of your digital experience: they are as vital a way for people to interact with a service or application as is a web site or a mobile experience. Read more »

Rafal Lukawiecki - Putting Science into the Business of Data Science

2016-09-30

A talk by Rafal Lukawiecki at the Microsoft Machine Learning & Data Science Summit (2016). Data science relies on the scientific method of reasoning to help make business decisions based on analytics. Let Rafal explain how his customers apply the trusted processes and the principles of hypothesis testing with machine learning and statistics towards solving their day-to-day, practical business problems. Rafal will speak from his 10 years of experience in data mining and statistics, using the Microsoft data platform for high-value customer identification, recommendation and gap analysis, customer paths and acquisition modelling, price optimization and other forms of advanced analytics. Read more »

Edward Tufte - The Future of Data Analysis

2016-09-29

A keynote talk by Edward Tufte at the Microsoft Machine Learning & Data Science Summit (2016). Introduction by David Smith.

Python: First Steps with MongoDB

2016-09-28     MongoDB Python

I’m busy working my way through Kyle Banker’s MongoDB in Action. Much of the example code in the book is given in Ruby. Despite the fact that I’d love to learn more about Ruby, for the moment it makes more sense for me to follow along with Python. Read more »

xkcd: Hand Sanitiser

2016-09-21     xkcd