Cleaning GBIF data for the use in Ecology

Overview
Questions:
  • How can I get ecological data from GBIF?

  • How do I check and clean the data from GBIF?

  • Which ecoinformatics techniques are important to know for this type of data?

Objectives:
  • Get occurrence data on a species

  • Visualize the data to understand them

  • Clean GBIF dataset for further analyses

Requirements:
Time estimation: 0 hours 30 minutes
Supporting Materials:
Last modification: Oct 28, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

GBIF (Global Biodiversity Information Facility, www.gbif.org) is for sure THE most remarkable biodiversity data aggregator worldwide giving access to more than 1 billion records across all taxonomic groups. The data provided via these sources are highly valuable for research. However, some issues exist concerning data heterogeneity, as they are obtained from various collection methods and sources.

In this tutorial we will propose a way to clean occurrence records retrieved from GBIF.

This tutorial is based on the Ropensci Zizka tutorial.

Agenda

In this tutorial, we will cover:

  1. Introduction
  2. Retrive data from GBIF
    1. Get data
    2. Where do the records come from?
    3. Filtering data based on the data origin
    4. Have a look at the number of counts per record
    5. Filtering data on individual counts
    6. Have a look at the age of records
    7. Filtering data based on the age of records
    8. Taxonomic investigation
    9. Filtering
    10. Sub-step with OGR2ogr
    11. Visualize your data on a GIS oriented visualization
  3. Conclusion

Retrive data from GBIF

Get data

Hands-on: Data upload
  1. Create a new history for this tutorial

    Click the new-history icon at the top of the history panel.

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
  2. Import the files from GBIF: Get species occurrences data tool with the following parameters:
    • param-file “Scientific name of the species”: write the scientific name of something you are interested on, for example Loligo vulgaris
    • “Data source to get data from”: Global Biodiversity Information Facility : GBIF
    • “Number of records to return”: 999999 is a minimum value
    Comment

    The spocc Galaxy tool allows you to search species occurrences across a single or many data sources (GBIF, eBird, iNaturalist, EcoEngine, VertNet, BISON). Changing the number of records to return allows you to have all or limited numbers of occurrences. Specifying more than one data source will change the manner the output dataset is formatted.

  3. Check the datatype galaxy-pencil, it should be tabular

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
    • Select tabular
      • tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button
  4. Add tags galaxy-tags to the dataset
    • make them propagating tags (tags starting with #)
    • make a tag corresponding to the species (#LoligoVulgaris for example here)
    • and another tag mentioning the data source (#GBIF for example here).

    Tagging dataset like this is good practice in Galaxy, and will help you 1/ finding content of particular interest (using the filtering option on the history search form for example) and 2/ visualizing rapidly (notably thanks to the propagated tags) which dataset is associated to which content.

    • Click on the dataset
    • Click on galaxy-tags Add Tags
    • Add a tag starting with #

      Tags starting with # will be automatically propagated to the outputs of tools using this dataset.

    • Check that the tag is appearing below the dataset name

Where do the records come from?

Here we propose to investigate the content of the dataset looking notably at the “basisOfRecord” attribute to know more about heterogeneity related to the data collection origin.

Hands-on: "basisOfRecord" filtering
  1. Count tool with the following parameters:
    • param-file “from dataset”: output (output of Get species occurrences data tool)
    • “Count occurrences of values in column(s)”: c[17]
    Comment

    This tool is one of the important “classical” Galaxy tool who allows you to better synthesize information content of your data. Here we apply this tool to the 17th column (corresponding to the basisOfRecord attribute) but don’t hesitate to investigate others attributes!

Question
  1. How many different types of data collection origin are there?
  2. What is your assumption regarding this heterogeneity?
  1. 5
  2. each basisOfRecord type is related to different collection method so different data quality

Filtering data based on the data origin

Hands-on: Filter data on basisOfRecord GBIF attribute
  1. Filter tool with the following parameters:
    • param-file “Filter”: output (output of Get species occurrences data tool)
    • “With following condition”: c17=='HUMAN_OBSERVATION' or c17=='OBSERVATION' or c17=='PRESERVED_SPECIMEN'
    • “Number of header lines to skip”: 1
    Comment

    A comment about the tool or something else. This box can also be in the main text

    Question
    1. How many records are kept and what is the percentage of filtered data?
    2. Why are we keeping only these 3 types of data collection origin?
    1. 470 and 8.79% of records were drop out
    2. These data collection methods are the most relevant
  2. Add to the output dataset a propagating tag corresponding to the filtering criteria adding #basisOfRecord string for example

    • Click on the dataset
    • Click on galaxy-tags Add Tags
    • Add a tag starting with #

      Tags starting with # will be automatically propagated to the outputs of tools using this dataset.

    • Check that the tag is appearing below the dataset name

Have a look at the number of counts per record

Here we propose to have a look at the number of counts by record to know if there is some possible records with errors.

Hands-on: Summary statistics of count
  1. Summary Statistics tool with the following parameters:
    • param-file “Summary statistics on”: out_file1 (output of Filter tool)
    • “Column or expression: c72
  2. Add to the output dataset a propagating tag corresponding to the filtering criteria adding #individualCount string for example
Question
  1. What is the min and max of individual counts?
  1. From 1 to 100

Filtering data on individual counts

Hands-on: Filter data on individualCount GBIF attribute
  1. Filter tool with the following parameters:
    • param-file “Filter”: out_file1 (output of Filter tool)
    • “With following condition”: c72>0 and c72<99
    • “Number of header lines to skip”: 1
Question
  1. How many records are kept and what is the percentage of filtered data?
  2. How can you explain this result?
  3. Which propagated tag you can propose to add here?
  1. 50 and 89.29% o records were drop out
  2. An important percentage of data were drop out because of many records whithout any value for this individual count field
  3. As for the previous “count” step you are dealing with the individualCount column, you can add a to the output dataset a #individualCount tag for example

Have a look at the age of records

Hands-on: Here we propose to have a look at the age of records, through the `year` GBIF attribute to know if there is some ancient data to maybe not consider.
  1. Summary Statistics tool with the following parameters:
    • param-file “Summary statistics on”: out_file1 (output of Filter tool)
    • “Column or expression: c41
  2. Add to the output dataset a propagating tag corresponding to the filtering criteria adding #ageOfRecord string for example
Question
  1. What is the year of the older and younger records?
  2. Why do you think of interest to treat differently ancient and recent records?
  1. From 1903 to 2018
  2. We can assume ancient records are not made in the same way than recent one so keeping ancient records can enhance heterogeneity of our dataset.

Filtering data based on the age of records

Hands-on: Filter data on ageOfRecord GBIF attribute
  1. Filter tool with the following parameters:
    • param-file “Filter”: out_file1 (output of Get species occurrences data tool)
    • “With following condition”: c41>1945
    • “Number of header lines to skip”: 1
    Comment

    A comment about the tool or something else. This box can also be in the main text

    Question
    1. How many records are kept and what is the percentage of filtered data?
    2. Why are we keeping only data from 1945?
    1. 44 and 11.76% of records were drop out
    2. This arbitrary date allow to have only quite recent records, but you can specify another year.
  2. Add to the output dataset a propagating tag corresponding to the filtering criteria adding #ageOfRecord string for example

    • Click on the dataset
    • Click on galaxy-tags Add Tags
    • Add a tag starting with #

      Tags starting with # will be automatically propagated to the outputs of tools using this dataset.

    • Check that the tag is appearing below the dataset name

Taxonomic investigation

Hands-on: Investigate the taxonomic coverage, at the family level
  1. Count tool with the following parameters:
    • param-file “from dataset”: out_file1 (output of Filter tool)
    • “Count occurrences of values in column(s)”: c[31]
    Comment

    This column allows us to look at the different families associated to records. Normally, looking at a unique species, we will obtain only one family

Filtering

Hands-on: Filter data on family attribute
  1. Filter tool with the following parameters:
    • param-file “Filter”: out_file1 (output of Filter tool)
    • “With following condition”: c31=='Loliginidae'
    • “Number of header lines to skip”: 1
    Comment

    We here select only records with the family of interest, Loliginidae

Question
  1. Is the filtering here of interest ?
  2. Why keeping this step can be of interest?
  1. No, because 100% of records are kept
  2. Because this is an important step we have to take into account in such a GBIF data treatment, and if your goal is to create your own workflow you plan to use on others species, this can be of interest to keep this step

Sub-step with OGR2ogr

Hands-on: Convert occurrence dataset to GIS one for visualization
  1. OGR2ogr tool with the following parameters:
    • param-file “Gdal supported input file”: out_file1 (output of Filter tool)
    • “Conversion format”: GEOJSON
    • “Specify advanced parameters”: Yes, see full parameter list.
      • In “Add an input dataset open option”:
        • param-repeat “Insert Add an input dataset open option”
          • “Input dataset open option”: X_POSSIBLE_NAMES=longitude
        • param-repeat “Insert Add an input dataset open option”
          • “Input dataset open option”: Y_POSSIBLE_NAMES=latitude
Question
  1. Did you have access to standard output and error of the original R script?
  2. What kind of information you can retrieve here in the standard output and/or error?
  1. Yes, of course ;) A previsualization of stdout is visible when clicking on the history output dataset and full report accessible through the information button, then stdout or stderr (here you can see warnings on the stderr)
  2. The stderr is showing several warning related to automatic variable name mapping from GBIF to OGR plus information about application of a truncate process on a particularly long GeoJSON value

Visualize your data on a GIS oriented visualization

From your GeoJSON Galaxy history dataset, you can launch GIS visualization.

Hands-on: Launch OpenLayers to visualize a map with your filtered records
  1. Click on the Visualize tab on the upper menu and select Create Visualization
  2. Click on the OpenLayers icon
  3. Select the GeoJSON file from your history
  4. Click on Create Visualization
  5. Select Openlayers
Question
  1. You don’t see Opebnlayers? Did you know why?

1.If you don’t see Openlayers but others visualization types like Cytoscape, this means your datatype is JSON, not geojson. You have to change the datafile manually before visualizing it

Conclusion

In this tutorial we learned how to get occurrence records from GBIF and several steps to filter these data to be ready to analyze it! So now, let’s go for the show!

Key points
  • Take the time to look at your data first, manipulate it before analyzing it

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Ecology topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

  1. Zizka, A. Cleaning GBIF data for the use in biogeography. https://ropensci.github.io/CoordinateCleaner/articles/Cleaning_GBIF_data_with_CoordinateCleaner.html

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Yvan Le Bras, Simon Benateau, Cleaning GBIF data for the use in Ecology (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/ecology/tutorials/gbif_cleaning/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012



@misc{ecology-gbif_cleaning,
author = "Yvan Le Bras and Simon Benateau",
title = "Cleaning GBIF data for the use in Ecology (Galaxy Training Materials)",
year = "",
month = "",
day = ""
url = "\url{https://training.galaxyproject.org/training-material/topics/ecology/tutorials/gbif_cleaning/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol} Computational Biology}
}

                   

Congratulations on successfully completing this tutorial!