Building an example BioWeatherMap dataset
I want to make an example “BioWeatherMap” based on existing metagenomic data, so I’m looking for a dataset of 16s ribosomal dna (rDNA) sampled from tens or hundreds of environmental locations. First I’d start by creating a simple map rom the basic data (sequences + location) – something like the map below that was made for a marine virome project.

sampling sites from The Marine Viromes of Four Oceanic Regions - we can do a better job of mapping than this!
Then I’d like to do something a little more abstract. The first thing that comes to mind is distorting the map so that one of the dimensions of the data, such as read density (base pairs/ km^sq), is constant for each pixel. This would expand the map in areas that had been sampled and shrink it in areas that were not sampled. I actually think read density is a pretty boring thing to graph (at least until there are tens of thousands of samples), but I think this technique would be a neat way to represent something like a diversity metric for each sample. These kind of visualizations are called “cartograms.”
(Note: these are just some basic ideas. It might be the case that all density representations, such as heatmaps or the distortions I mentioned above, are inappropriate for representing sparse samples across a large area. Nonetheless, the first step is to get some data and start experimenting.)
In this cartogram “the sizes of states are proportional to the frequency of their appearance in news stories.” From Diffusion-based method for producing density-equalizing maps by Michael T. Gastner and M. E. J. Newman.
a tab-delimited example of a basic metagenomic dataset for constructing a map:
sample_id lat lon sequence_id 16s_sequence suspected_species
000001 32.131341 98.231332 0001 agcctagcacgga... Bacillus subtillis
000001 32.131341 98.231332 0002 agcgtaggttgac... Acinetobacter baylyi
I would be happy just with 10,000 entries in a single text file in a format similar to the one above (but note that lat/lon are identical for all sequences in a given sample). It would be even better if there were more dimensions of data. Here are some other potential columns:
- taxonomy (calculate a diversity metric from each sample from this? what else can we do with a taxonomy?)
- pathogenicity
- auto- or heterotrophic
- other metabolic information?
- GO terms, or something like them at an organismal level
- URL canonical species description in ncbi
- ? Please make suggestions in the comments.
Synthesizing such a dataset (as a large plaintext file or as a database) will require aggregating a variety of other datasets. I have no idea where to begin with them. If I know a particular species (Acinetobacer baylyi, for instance), is there a single entry point for deriving all this information in NCBI?
existing metagenomics datasets
I spent a couple of hours reading metagenomic papers and browsing around for datasets. Here’s a quick list of interesting resources. My naive first look didn’t turn up anything similar to the basic plaintext example above.
MEGAN – Metagenome Analysis Software & sample data
Methods for comparative metagenomics (introducing MEGAN) (paper)
UniFrac software & sample data (Look for the datasets they used to construct the phylogeny trees)
UniFrac paper (paper)
The Marine Viromes of Four Oceanic Regions (paper)
Data from Marine Viromes study (and more!)
ncbi metagenomics book soil chapter (Waseca County Farm Soil)
16s rDNA identified from Waseca County Farm Soil dataset (in ncbi’s nt database)
CAMERA: A Community Resource for Metagenomics (their webview of the datafiles looks interesting; the datafiles themselves are just fasta)
CAMERA: Surface Water Marine Microbial Community Gene Expression project
inspirational infographics:
Travel-time Maps by Chris Lightfoot & Tom Steinberg
Center for Mathematical Modeling infographics by Juan Pablo De Gregorio




Activity