data_gatherer Documentation

Data Gatherer is a Python library for automatically extracting dataset references from scientific publications. It processes full-text articles—whether in HTML or XML format—and uses both rule-based and LLM-based methods to identify and structure dataset citations.

What It Does

Parses scientific articles from open-access sources like PubMed Central (PMC).
Extracts dataset mentions from structured sections (e.g., Data Availability, Supplementary Material).
Supports two main strategies:
- Retrieve-Then-Read (RTR): First retrieves relevant sections using hand-crafted rules, then applies LLMs.
- Full-Document Read (FDR): Applies LLMs to the full text without section filtering.
Outputs structured results in JSON format.
Includes support for known repositories (e.g., GEO, PRIDE, MassIVE) via a configurable ontology.

Use Cases

Helping data curators and librarians identify datasets cited in publications.
Supporting meta-analysis and secondary data discovery.
Enabling dataset indexing and retrieval across the open-access literature.

You can find the source code in our GitHub repository.

User Guide

API Reference

Docs Data Gatherer
- DataGatherer