Accessing Open Data from AWS

There’s a magnificent variety of open data available on AWS. To see the full list, head over to the Registry of Open Data on AWS. When you find something that’s of interest to you, click through to the respective page. The vital piece of information on this page is the Amazon Resource Name (ARN). Grab the final portion of the ARN. That’s the string that uniquely identifies the bucket on S3.

Refining an AWS IAM Policy for Flintrock

Flintrock is a tool for launching a Spark cluster on AWS. To get it working initially I needed an IAM (Identity and Access Management) user with the following policies: AmazonEC2FullAccess and IAMFullAccess. Without these I got errors like botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetInstanceProfile operation: User: arn:aws:iam::690534650866:user/datawookie is not authorized to perform: iam:GetInstanceProfile on resource: instance profile EMR_EC2_DefaultRole and botocore.exceptions.ClientError: An error occurred (UnauthorizedOperation) when calling the DescribeVpcs operation: You are not authorized to perform this operation.

Creating an Amazon Machine Image

Creating an Amazon Machine Image (AMI) makes it quick and simple to rebuild a specific EC2 setup. This post illustrates the process by creating an AMI with ethminer and NVIDIA GPU drivers. Of course you’d never use this for mining Ether because the hardware costs are still too high!

Diagnosing Killed Jobs on EC2

I’ve got a long running optimisation problem on a EC2 instance. Yesterday it was mysteriously killed. I shrugged it off as an anomaly and restarted the job. However, this morning it was killed again. Definitely not a coincidence! So I investigated. This is what I found and how I am resolving the problem.

Creating a S3 Bucket

There are many good reasons to use S3 (Simple Storage Service) storage. This is a quick overview of how to create a S3 bucket.

Hosting a Plumber API on AWS

I’ve been putting together a small proof-of-concept API using R and plumber. It works flawlessly on my local machine and I was planning on deploying it on an EC2 instance to demo it for a client. However, I ran into a snag: despite opening the required port in my Security Group I was not able to access the API. This is what I needed to do to get it working.

Creating an AWS Spot Instance

EC2 Spot Instances can provide very affordable computing on EC2 by allowing access to unused capacity at significant discounts.

Driving AWS from the Command Line

Although it’s very handy (and easy) to set up some cloud resources using the AWS Management Console, once you know what you need it makes a lot of sense to automate the process. Fortunately there’s a handy little command line tools, aws, which makes this eminently possible. The AWS CLI Command Reference is the definitive resource for this tool. There’s a mind boggling array of possibilities. We’ll take a look at a small selection of them.

Retrieving Kaggle Data from the Command Line

We’ve been building some models for Kaggle competitions using an EC2 instance for compute. I initially downloaded the data locally and then pushed it onto EC2 using SCP. But there had to be a more efficient way to do this, especially given the blazing fast bandwidth available on AWS. Enter kaggle-cli. Update: Apparently kaggle-cli has been deprecated in favour of kaggle-api. More information below.

Adding a Volume to an Ubuntu EC2 Instance

Some quick notes on adding a storage volume to an EC2 instance.

Remote Desktop on an Ubuntu EC2 Instance

A couple of options for remote access to desktop applications on a EC2 host.

Adding Users to an EC2 Ubuntu Instance

By default an EC2 instance has only a single user other than root. For example, on a Ubuntu instance, that user is ubuntu. If there will be multiple people accessing the instance then it’s generally necessary for each of them to have their own account. Setting this up is pretty simple, it just requires sorting out some authentication details.

Deploying Jupyter on AWS using Docker

Amazon’s EC2 Container Services (ECS) is an orchestrated system for deploying Docker containers on AWS. This post is about not using ECS. --

Amazon EC2: Adding Swap

So, after upgrading to R 3.2.0 on my EC2 instance, I was installing newer versions of various packages and I ran into a problem with dplyr: virtual memory exhausted! Seemed like a good time to add some swap.

Amazon EC2: Upgrading R

After installing R and Shiny on my EC2 instance I discovered that the default version of R was a little dated and I wanted to update to R 3.2.0. It’s not terribly complicated, but here are the steps I took. When you launch R you should find that it is the sparkling new version. Now you will want to update all of the packages too, so launch R (as root) and then do:

Hosting Shiny on Amazon EC2

I recently finished some work on a Shiny application which incorporated a Random Forest model. The model was stored in a RData file and loaded by server.R during initialisation. This worked fine when tested locally but when I tried to deploy the application on I ran into a problem: evidently you can only upload server.R and ui.R files. Nothing else.