web scraping

Retail Data: Scraping & API

I’ve been wanting to gather data on retail prices for quite some time. Finally, just before Christmas 2019, I had some time on my hands, so I started to put something together. The Plan This was the plan: a fleet of scrapers, each focusing on a specific retailer; the scrapers submit data to a service which would validate the data and persist in a database; and expose the data via a simple REST API.

Scraping Machinery Parts

Scraping prices from a supplier of replacement parts for heavy machinery.

Scraping the Turkey Accordion

One of the things I like most about web scraping is that almost every site comes with a new set of challenges. The Accordion Concept I recently had to scrape a few product pages from the site of a large retailer. I discovered that these pages use an “accordion” to present the product attributes. Only a single panel of the accordion is visible at any one time. So, for example, you toggle the Details panel open to see the associated content.

Favourite Talks from useR 2017

RSelenium and Java Heap Space

I’m in the process of deploying a scraper on a DigitalOcean instance. The scraper uses RSelenium with the PhantomJS browser. I ran into a problem though. Although it worked flawlessly on my local machine, on the remote instance it broke with the following error: Selenium message:Java heap space Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command. class: java.lang.OutOfMemoryError Further Details: run errorDetails method Execution halted Clearly Java a memory issue.

View POST Data using Chrome Developer Tools

When figuring out how to formulate the contents of a POST request it’s often useful to see the “typical” fields submitted directly from a web form. Open Developer Tools in Chrome. Select the Network tab (at the top). Submit the form. Watch the magic happening in the Developer Tools console. Click on the first document listed in the Developer Tools console, then select the `Headers` tab. That’s just scratching the surface of the wealth of information available on the Network tab.

Web Scraping and "invalid multibyte string"

A couple of my collaborators have had trouble using read_html() from the xml2 package to access this Wikipedia page.

Top 250 Movies at IMDb

Some years ago I allowed myself to accept a challenge to read the Top 100 Novels of All Time (complete list here). This list was put together by Richard Lacayo and Lev Grossman at Time Magazine. To start with I could tick off a number of books that I had already read. That left me with around 75 books outstanding. So I knuckled down. The Lord of the Rings had been on my reading list for a number of years, so this was my first project.