DuckDB and GloNAF

How to process large database files

On my quest to process some VERY large files I came across DuckDB, an its R package duckdb. DuckDB is a relational (table-oriented) DBMS (database management system) that supports the Structured Query Language (SQL).

To try it out, as an example, I chose the GloNAF (Global Naturalized Alien Flora) database. This database about naturalised alien plant species, covers all terrestrial regions of the globe, and includes four tables (Taxon_x_List, List, Region, Reference) that are linked by different id variables. See Pyšek et al. (2017) for more info.

Since I haven’t seen any code to get data out of the GloNAF database, I’m sharing here the code hoping it helps someone else. I will use duckdb to load the data and perform some queries combining the different tables.

Let’s go!

As the R DuckDB documentation explains, to use DuckDB, you must first create a connection object that represents the database. We will use dbConnect() to start an in-memory database.

library(duckdb)
library(tmap)
tmap_mode('view')
library(rnaturalearth)
library(sf)
sf::sf_use_s2(FALSE)
library(tidyverse)

duck_glonaf <- dbConnect(duckdb())

We will then read the tables to the database using duckdb_register(). If you check your R Memory you will notice that the usage is little and the queries are super fast!

duckdb_register(duck_glonaf, 'list',
                read_csv('data/List_GloNAF_vanKleunenetal2018Ecology.csv'))

duckdb_register(duck_glonaf, 'region',
                read_csv('data/Region_GloNAF_vanKleunenetal2018Ecology.csv'))

duckdb_register(duck_glonaf, 'taxon',
                read_csv('data/Taxon_x_List_GloNAF_vanKleunenetal2018Ecology.csv'))

Now, we will query the database using dbGetQuery(). DuckDB supports the standard database interface methods to send queries and retrieve result sets. And, amazingly, we can also use dplyr functions to process the query.

Our query seeks to get a list of species per country. So, we will need to combine the tables Taxon_x_List and Region, using the variable region_id.

query <-
  'SELECT country_ISO, country, standardized_name
   FROM taxon JOIN region
   ON taxon.region_id = region.region_id'

dbGetQuery(duck_glonaf, query) %>%
  as_tibble() %>%
  distinct(country_ISO, standardized_name, .keep_all = T) -> GLONAF

To test our query results, let’s see a sample of 20 taxa from Uruguay <3 and Argentina.

GLONAF %>% filter(country=='Uruguay' | country=='Argentina') %>%
  select(Country=country, Taxa=standardized_name) %>%
  slice_sample(n = 20)
# A tibble: 20 × 2
   Country   Taxa                           
   <chr>     <chr>                             
 1 Argentina Phyllanthus tenellus              
 2 Uruguay   Rostraria cristata                
 3 Uruguay   Galium aparine                    
 4 Argentina Aira elegantissima                
 5 Argentina Vulpia bromoides                  
 6 Uruguay   Ipomoea cairica                   
 7 Uruguay   Fumaria capreolata                
 8 Argentina Eriochloa pseudoacrotricha        
 9 Argentina Borago officinalis                
10 Uruguay   Polypogon viridis                 
11 Argentina Lupinus arboreus                  
12 Uruguay   Melilotus indicus                 
13 Argentina Cuscuta epithymum var. macranthera
14 Argentina Myosotis azorica                  
15 Uruguay   Fraxinus americana                
16 Uruguay   Pterocarya rehderiana             
17 Argentina Barbarea intermedia               
18 Uruguay   Hedypnois rhagadioloides          
19 Argentina Lamarckia aurea                   
20 Argentina Ranunculus muricatus              

And, to finish, let’s map the number of taxa per country. To do that, we will join the data to a world map using ne_countries() from the rnaturalearth package, and finally map the number of taxa per country with tmap.

world <- ne_countries(scale = 50, returnclass = 'sf') %>%
  select(iso_a3)

GLONAF_taxa <- left_join(world %>% rename(country_ISO=iso_a3),
                         GLONAF) %>%
    group_by(country_ISO) %>%
    summarise(taxa=n_distinct(standardized_name, na.rm=T))

tm_shape(GLONAF_taxa %>% mutate(taxa=ifelse(taxa==0, NA, taxa)),
         bbox = st_bbox(world)) +
  tm_fill('taxa', palette = 'YlGnBu',
          style = 'log10_pretty',
          textNA = '0',
          colorNA='grey95',
          legend.reverse = T,
          title = 'GloNAF taxa count',
          alpha=0.5) +
  tm_view(set.view = 2)

To finish up, we disconnect the database and voilà.

dbDisconnect(duck_glonaf)

That’s all!

Florencia Grattarola
Florencia Grattarola
Postdoc Researcher

Uruguayan biologist doing research in macroecology and biodiversity informatics.