An R package for reading, writing, integrating and publishing data using the Ecological Metadata Language (EML) format.
An extensive and rapidly growing collection of richly annotated phylogenetics data is now available in the NeXML format. NeXML relies on state-of-the-art data exchange technology to provide a format that can be both validated and extended, providing a data quality assurance and and adaptability to the future that is lacking in other formats Vos et al 2012.
The stable version is on CRAN
The development version of RNeXML is available on Github. With the
devtools package installed on your system, RNeXML can be installed using:
RNeXML package provides many convenient functions to add and extract
nexml objects in the R environment without requiring
the reader to understand the details of the NeXML data structure and
making it less likely that a user will generate invalid NeXML syntax
that could not be read by other parsers. The
nexml object we have been using
in all of the examples is built on R's S4 mechanism. Advanced users may
sometimes prefer to interact with the data structure more directly using
R's S4 class mechanism and subsetting methods. Many R users are more familiar
with the S3 class mechanism (such as in the
ape package phylo objects)
rather than the S4 class mechanism used in phylogenetics packages such as
phylobase vignette provides an excellent introduction
to these data structures. Users already familiar with subsetting lists and other
S3 objects in R are likely familar with the use of the
$ operator, such as
phy$edge. S4 objects simply use an
@ operator instead (but cannot be subset
using numeric arguments such as
phy[] or named arguments such as phy[["edge"]]).
nexml object is an S4 object, as are all of its components (slots). Its
hierarchical structure corresponds exactly with the XML tree of a NeXML file, with
the single exception that both XML attributes and children are represented as slots.
S4 objects have constructor functions to initialize them. We create a new
object with the command:
nex <- new("nexml")
We can see a list of slots contained in this object with
 "version" "generator" "xsi:schemaLocation"  "namespaces" "otus" "trees"  "characters" "meta" "about"  "xsi:type"
Some of these slots have already been populated for us, for instance, the schema version and default namespaces:
nex "http://www.nexml.org/2009" xsi "http://www.w3.org/2001/XMLSchema-instance" xml "http://www.w3.org/XML/1998/namespace" cdao "http://purl.obolibrary.org/obo/cdao.owl" xsd "http://www.w3.org/2001/XMLSchema#" dc "http://purl.org/dc/elements/1.1/" dcterms "http://purl.org/dc/terms/" ter "http://purl.org/dc/terms/" prism "http://prismstandard.org/namespaces/1.2/basic/" cc "http://creativecommons.org/ns#" ncbi "http://www.ncbi.nlm.nih.gov/taxonomy#" tc "http://rs.tdwg.org/ontology/voc/TaxonConcept#" "http://www.nexml.org/2009"
[email protected] serves the same role as
function, but provides direct access to the slot data. For instance,
with this syntax we could also overwrite the existing namespaces with
[email protected] <- NULL. Changing the namespace in this way is not
Some slots can contain multiple elements of the same type, such as
otus. For instance, we see that
 "ListOfcharacters" attr(,"package")  "RNeXML"
is an object of class
ListOfcharacters, and is currently empty,
In order to assign an object to a slot, it must match the class definition
of the slot. We can create a new element of any given class with the
nex@characters <- new("ListOfcharacters", list(new("characters")))
and now we have a length-1 list of character matrices,
and we access the first character matrix using the list notation,
[]. Here we check the class is a
 "characters" attr(,"package")  "RNeXML"
Direct subsetting has two primary use cases: (a) useful in looking up (and possibly editing) a specific value of an element, or (b) when adding metadata annotations to specific elements. Consider the example file
f <- system.file("examples", "trees.xml", package="RNeXML") nex <- nexml_read(f)
We can look up the species label of the first
otu in the first
label "species 1"
We can add metadata to this particular OTU using this subsetting format
nex@otus[]@otu[]@meta <- c(meta("skos:note", "This species was incorrectly identified"), nex@otus[]@otu[]@meta)
Here we use the
c operator to append this element to any existing meta annotations to this otu.
add_basic_meta() function takes as input an existing
(like the other
add_ functions, if none is provided it will create one), and at the time
of this writing any of the following
citation. Other metadata elements and corresponding parameters may
be added in the future.
nexml object for the phylogeny
bird.orders and add appropriate metadata:
data("bird.orders") birds <- add_trees(bird.orders) birds <- add_basic_meta( title = "Phylogeny of the Orders of Birds From Sibley and Ahlquist", description = "This data set describes the phylogenetic relationships of the orders of birds as reported by Sibley and Ahlquist (1990). Sibley and Ahlquist inferred this phylogeny from an extensive number of DNA/DNA hybridization experiments. The ``tapestry'' reported by these two authors (more than 1000 species out of the ca. 9000 extant bird species) generated a lot of debates. The present tree is based on the relationships among orders. The branch lengths were calculated from the values of Delta T50H as found in Sibley and Ahlquist (1990, fig. 353).", citation = "Sibley, C. G. and Ahlquist, J. E. (1990) Phylogeny and classification of birds: a study in molecular evolution. New Haven: Yale University Press.", creator = "Sibley, C. G. and Ahlquist, J. E.", nexml=birds)
Instead of a literal string, citations can also be provided in R's
bibentry type, which is the one in which R package citations are obtained:
birds <- add_basic_meta(citation = citation("ape"), nexml = birds)
taxize_nexml() function uses the R package
[@Chamberlain_2013] to check each taxon label against the NCBI database.
If a unique match is found, a metadata annotation is added to the taxon
providing the NCBI identification number to the taxonomic unit.
birds <- taxize_nexml(birds, "NCBI")
If no match is found, the user is warned to check for possible typographic errors in the taxonomic labels provided. If multiple matches are found, the user will be prompted to choose between them.
We can get a list of namespaces along with their prefixes from the
prefixes <- get_namespaces(birds) prefixes["dc"]
We create a
meta element containing this annotation using the
modified <- meta(property = "prism:modificationDate", content = "2013-10-04")
We can add this annotation to our existing
birds NeXML file using the
add_meta() function. Because we do not specify a level, it is added to
the root node, referring to the NeXML file as a whole.
birds <- add_meta(modified, birds)
The built-in vocabularies are just the tip of the iceberg of established
vocabularies. Here we add an annotation from the
skos namespace which
describes the history of where the data comes from:
history <- meta(property = "skos:historyNote", content = "Mapped from the bird.orders data in the ape package using RNeXML")
skos is not in the current namespace list, we add it with a
url when adding this meta element. We also specify that this annotation
be placed at the level of the
trees sub-node in the NeXML file.
birds <- add_meta(history, birds, level = "trees", namespaces = c(skos = "http://www.w3.org/2004/02/skos/core#"))
For finer control of the level at which a
meta element is added,
we will manipulate the
nexml R object directly using S4 sub-setting,
as shown in the supplement.
Much richer metadata annotation is possible. Later we illustrate how
metadata annotation can be used to extend the base NeXML format to
represent new forms of data while maintaining compatibility with any
NeXML parser. The
RNeXML package can be easily extended to support
helper functions such as
taxize_nexml to add additional metadata
without imposing a large burden on the user.
A call to the
nexml object prints some metadata summarizing the data structure:
A nexml object representing: 1 phylogenetic tree blocks, where: block 1 contains 1 phylogenetic trees 46 meta elements 0 character matrices 23 taxonomic units Taxa: Struthioniformes, Tinamiformes, Craciformes, Galliformes, Anseriformes, Turniciformes ... NeXML generated by RNeXML using schema version: 0.9 size: 372.7 Kb
We can extract all metadata pertaining to the NeXML document as a whole
(annotations of the XML root node,
<nexml>) with the command
meta <- get_metadata(birds)
This returns a data.frame of available metadata. We can see the kinds of metadata recorded from the names:
Source: local data frame [10 x 7] meta property datatype (chr) (chr) (chr) 1 m2 dc:title xsd:string 2 m3 dc:creator xsd:string 3 m4 dc:description xsd:string 4 m5 NA NA 5 m6 dcterms:bibliographicCitation xsd:string 6 m7 dc:creator xsd:string 7 m8 dc:pubdate xsd:string 8 m9 NA NA 9 m20 dcterms:bibliographicCitation xsd:string 10 m44 prism:modificationDate xsd:string Variables not shown: content (chr), xsi.type (chr), rel (chr), href (chr)
We can also access a table of taxonomic metadata:
Source: local data frame [23 x 5] otu label about xsi.type otus (chr) (chr) (chr) (lgl) (chr) 1 ou1 Struthioniformes #ou1 NA os1 2 ou2 Tinamiformes #ou2 NA os1 3 ou3 Craciformes #ou3 NA os1 4 ou4 Galliformes #ou4 NA os1 5 ou5 Anseriformes #ou5 NA os1 6 ou6 Turniciformes #ou6 NA os1 7 ou7 Piciformes #ou7 NA os1 8 ou8 Galbuliformes #ou8 NA os1 9 ou9 Bucerotiformes #ou9 NA os1 10 ou10 Upupiformes #ou10 NA os1 .. ... ... ... ... ...
Which returns text from the otu element labels, typically used to define taxonomic names, rather than text from explicit meta elements.
We can also access metadata at a specific level (or use
to extract all meta elements in a list). Here we show only the first
otu_meta <- get_metadata(birds, level="otus/otu") otu_meta
Source: local data frame [23 x 9] meta property datatype content xsi.type rel (chr) (lgl) (lgl) (lgl) (chr) (chr) 1 m21 NA NA NA ResourceMeta tc:toTaxon 2 m22 NA NA NA ResourceMeta tc:toTaxon 3 m23 NA NA NA ResourceMeta tc:toTaxon 4 m24 NA NA NA ResourceMeta tc:toTaxon 5 m25 NA NA NA ResourceMeta tc:toTaxon 6 m26 NA NA NA ResourceMeta tc:toTaxon 7 m27 NA NA NA ResourceMeta tc:toTaxon 8 m28 NA NA NA ResourceMeta tc:toTaxon 9 m29 NA NA NA ResourceMeta tc:toTaxon 10 m30 NA NA NA ResourceMeta tc:toTaxon .. ... ... ... ... ... ... Variables not shown: href (chr), otu (chr), otus (chr)
We often want to combine metadata from multiple tables. For instance, in this exercise we want to include the taxonomic identifier and id value for each species returned in the character table. This helps us more precisely identify the species whose traits are described by the table.
To begin, let's generate a
NeXML file using the tree and trait data from the
geiger package's "primates" data:
data("primates") add_trees(primates$phy) %>% add_characters(primates$dat, ., append=TRUE) %>% taxize_nexml() -> nex
(Note that we've used
dplyr's cute pipe syntax, but unfortunately our
add_ methods take the
nexml object as the second
argument instead of the first, so this isn't as elegant since we need the stupid
. to show where the piped output should go...)
We now read in the three tables of interest. Note that we tell
get_characters to give us species labels as there own column, rather than as rownames. The latter is the default only because this plays more nicely with the default format for character matrices that is expected by
geiger and other phylogenetics packages, but is in general a silly choice for data manipulation.
otu_meta <- get_metadata(nex, "otus/otu") taxa <- get_taxa(nex) char <- get_characters(nex, rownames_as_col = TRUE)
We can take a peek at what the tables look like, just to orient ourselves:
Source: local data frame [216 x 9] meta property datatype content xsi.type rel (chr) (lgl) (lgl) (lgl) (chr) (chr) 1 m49 NA NA NA ResourceMeta tc:toTaxon 2 m50 NA NA NA ResourceMeta tc:toTaxon 3 m51 NA NA NA ResourceMeta tc:toTaxon 4 m52 NA NA NA ResourceMeta tc:toTaxon 5 m53 NA NA NA ResourceMeta tc:toTaxon 6 m54 NA NA NA ResourceMeta tc:toTaxon 7 m55 NA NA NA ResourceMeta tc:toTaxon 8 m56 NA NA NA ResourceMeta tc:toTaxon 9 m57 NA NA NA ResourceMeta tc:toTaxon 10 m58 NA NA NA ResourceMeta tc:toTaxon .. ... ... ... ... ... ... Variables not shown: href (chr), otu (chr), otus (chr)
Source: local data frame [233 x 5] otu label about xsi.type otus (chr) (chr) (chr) (lgl) (chr) 1 ou24 Allenopithecus_nigroviridis #ou24 NA os2 2 ou25 Allocebus_trichotis #ou25 NA os2 3 ou26 Alouatta_belzebul #ou26 NA os2 4 ou27 Alouatta_caraya #ou27 NA os2 5 ou28 Alouatta_coibensis #ou28 NA os2 6 ou29 Alouatta_fusca #ou29 NA os2 7 ou30 Alouatta_palliata #ou30 NA os2 8 ou31 Alouatta_pigra #ou31 NA os2 9 ou32 Alouatta_sara #ou32 NA os2 10 ou33 Alouatta_seniculus #ou33 NA os2 .. ... ... ... ... ...
Source: local data frame [6 x 2] taxa x (chr) (dbl) 1 Allenopithecus_nigroviridis 8.465900 2 Allocebus_trichotis 4.368181 3 Alouatta_belzebul 8.729074 4 Alouatta_caraya 8.628735 5 Alouatta_coibensis 8.764053 6 Alouatta_fusca 8.554489
Now that we have nice
data.frame objects for all our data, it's easy to join them into the desired table with a few obvious
taxa %>% left_join(char, by = c("label" = "taxa")) %>% left_join(otu_meta, by = "otu") %>% select(otu, label, x, href)
Source: local data frame [233 x 4] otu label x (chr) (chr) (dbl) 1 ou24 Allenopithecus_nigroviridis 8.465900 2 ou25 Allocebus_trichotis 4.368181 3 ou26 Alouatta_belzebul 8.729074 4 ou27 Alouatta_caraya 8.628735 5 ou28 Alouatta_coibensis 8.764053 6 ou29 Alouatta_fusca 8.554489 7 ou30 Alouatta_palliata 8.791790 8 ou31 Alouatta_pigra 8.881836 9 ou32 Alouatta_sara 8.796339 10 ou33 Alouatta_seniculus 8.767173 .. ... ... ... Variables not shown: href (chr)
Because these are all from the same otus block anyway, we haven't selected that column, but were it of interest it is also available in the taxa table.
RNeXML in publications use:
Carl Boettiger, Scott Chamberlain, Hilmar Lapp, Kseniia Shumelchyk and Rutger Vos (2015). RNeXML: Implement semantically rich I/O for NeXML format. R package version 2.0.4. http://CRAN.R-project.org/package=RNeXML