Research Spotlight: February 2017


 

Allocating more memory to OpenRefine - and other helpful information for handling large datasets

-- Contributed by Chris Evelyn, University of California - Santa Barbara, along with Deborah Paul and Shelley James, iDigBio

This month's Research Spotlight contribution resulted from a recent iDigBio workshop where participants learned the basics of OpenRefine. Finding a limitation to the size of the dataset that could be manipulated, Chris found the following solution to working with large datasets from iDigBio and other biodiversity data aggregators.  OpenRefine (formerly Google Refine) is a powerful tool for helping with the cleaning of messy data - ideal for natural history collection managers, data managers, and researchers using biodiversity data alike.  OpenRefine 2.6 is still in development and some may find the Google Refine 2.5 more stable - either can be downloaded here.


 

1. The scenario: Creating and working with a database containing ~350,000 rows of data from iDigBio and VertNet for a research project. Needs included cleaning dates, standardizing collector and locality information.

2. The issue: OpenRefine typically has issues loading datasets of more than about 100,000 rows of data, resulting in a lack of response during the 'Create Project' step.  


 

3. Solution: Allow OpenRefine to allocate more memory to its processes. The default setting is 1024MB (~1GB).

 

4. What the problem looked like: During the 'Create Project' step the "Heap usage" would max out and show something like "1000/1000MB used". In the newer OpenRefine versions this message pops up with a red highlight. Once this happens the time to finish increases with no result. 

 

5. Fixing the problem: There are three steps to fixing this issue:

 

A) Be sure that the Java Development Kit is installed for your operating system. Download the appropriate kit here:  http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html. Manage the path so OpenRefine can find it as explained here: https://confluence.atlassian.com/doc/setting-the-java_home-variable-in-windows-8895.html

 

B) Allocate more memory. Be sure to check whether you are using the .bat file or the .exe application to run OpenRefine as this affects how you fix the problem. You end up having to edit the .ini files, and instructions for both fixes are explained here: https://github.com/OpenRefine/OpenRefine/wiki/FAQ:-Allocate-More-Memory

 

The amount of memory needed will obviously vary depending on the job and on what your computer can handle. For the 350,000 rows of biodiversity data (with lots of columns!) I needed about 8 times the amount of default RAM (8*1024= 8192). I have 16GB RAM so this was fine but definitely keep in mind how much RAM you have and what other applications and operating system you are running at the time. 

 

C) The changes may work if you close all OpenRefine sessions and start over. If not then, restart your computer.

Need more facets? A faceted search is a technique for grouping or organizing data by applying filters. Say you want to facet a large dataset and on doing so, you get a message from OpenRefine (or Google Refine) saying you have exceeded the facet limit.

While in OpenRefine, enter http://127.0.0.1:3333/preferences into the address bar at the top of the browser window:

By clicking on "Edit", you can increase (or decrease) the facet limit.

 

Need more information? The OpenRefine GitHub Wiki and the OpenRefine website have links to many useful tips and tricks for users.

After a tutorial? Check out the OpenRefine tutorial Dimitri Bronsens & Peter Desmit gave at TDWG2016. 

Other questions or issues? Send your queries to, or join, the OpenRefine Google Group.