Archived Data

Mining the Literature

MPDB will contain two main "branches" of data; owned (or private) data, which will be entered under the "project" umbrella, and be available for sharing and editing by collaborators within the project. The other branch of data within MPDB is archive data - data which has been extracted from peer-reviewed, published manuscripts, and which will, initially, consist of the following data components:

  1. A unique publication identifier (GeoRef Ascession Number (AN)) - the AN will uniquely identify article title, authors, serial/book/other title, volume and pagination, and publisher information if included
  2. location coordinates for all unique samples identified in the publication
  3. A corresponding rock type for each unique sample identified in the publication
  4. If given, the mineral assemblage for each unique sample identified in the publication
Once these basic items are extracted from the publication and added to the database, the data will mined and transferred to the database. This data consists of a) images and b) numeric data. For initial population of the database, it is expected that images mined and ported to the database will be limited to:
  • Maps including sample locations
  • outcrop or hand-sample photographs
  • photomicrographs (binocular, unpolarized, thin-section scans, or plane-polarized or cross-nicol images in transmitted or reflected light)
  • outcrop, hand-sample, or thin section sketches (in older publications)
  • various types of electron-probe imagery (SE, BSE, TE, EBSD, OC, etc.)
  • X-ray element distribution maps
This list is clearly not exhaustive, and doesn't touch on a large number of image types common to publications (cartesian or barycentric element variation diagrams, spectra, structural measurement diagrams, contoured composition diagrams, grain analytical traverses, etc., etc.). What this list does is emphasize spatial relationships between the sample, constituent subsamples, and the chemical data collected from these subsamples. Additionally, MPDB will include a user tool kit that will allow production of basic cartesian or barycentric composition plots from the included data, so inclusion of these images types is redundant.

The numeric data extracted from the publications (at least initially) is to be restricted to compositional data from whole rock analyses, or analyses of individual mineral phases; either in situ spot analyses, or, for older publications, wet chemistry. This excludes isotopic data, and all age data...but as the database matures, these data types may be included, as well. Furthermore, no derivative data will be included (at least initially) in the database. This includes secondary data (e.g., element ratios), and tertiary (also interpretive) data, such as pressure-temperature estimates. Again, the MPDB tool kit included with the database will allow users to calculate secondary data (e.g., ratios) and to make estimations of extensive or intensive parameters which require end-user interpretation.

Scope of the initial DB population with archived data

In the first 100-200 articles to be mined for inclusion in MPDB, we would like to achieve a broad survey of the metamorphic petrology literature. To achieve this end (inasmuch as it is possible), we will strive for a balance of

  • rock types (pelites will clearly dominate, but amphibolite, carbonate or calc-silicate, impure quartzite, and quartzofeldpsathic compositions will be included)
  • metamorphic facies (to please those who love illite and sapphirine!)
  • metamorphic environments (contact vs regional metamorphism)
  • geographic location
  • journal
Additionally, we want to make available for sharing the images and compositional data of landmark papers from the metamorphic petrology literature (e.g., Eskola, Goldschmidt, Tilley, Buddington, Miyashiro, Harker, Barrow, Korzhinskii, Albee etc.) that emphasize petrologic and field relationships as well as the collection of quantitative mineral compositional data.

Problems associated with data extraction

In the initial stages of data mining from the literature, many problems will be come apparent, as they relate to the structure of the database, and how individual images or data are correlated with data tags or metadata. For example:

  • If only averaged analyses are presented, how are they treated in the database?
  • When a multi-panel image (say, one with 4 element maps of the same garnet) is extracted, is it extracted as a single image, or broken up into component images? If it is treated as a single image file, how are multiple data labels applied to it - or are they applied at all?
  • how are location coordinates entered for samples with only general locations (say, a dot on a map)? Are bounding coordinates entered, or a best estimate (e.g., DataThief, Google Earth)? Or, is the article excluded from data mining because of the poor sample location control? What about issues with coordinate conversion from archaic projection systems (e.g., Imperial Grid)?
  • Vagaries in presentation of the sample mineral assemblage? If modes are given, should they be listed, or subsumed into a present/absent designation?
  • Issues with OCR - many articles will require scanning and subsequent text rendering - how will quality control be executed on mineral data entered into the DB, and will the user perform QC before uploading archived data?
Each data mining operation will carry its own unique issues, but the common issues will become apparent as more data mining is completed. To that end, each data extraction is paired with a text log of issues that crop up during the process, some operational, some interpretive. An example is attached above. Forgive the stream-of-consciousness style, but that's how I did it.

Comments and such are welcome…