NPdia Help & Documentation

Reference guide for understanding and using the database

Overview

NPdia is a manually curated database of Type I PKS (T1PKS) and NRPS biosynthetic pathways from actinomycetes. Each entry represents a complete biosynthetic pathway reconstructed from published literature, with every intermediate captured as a SMILES string.

Column Descriptions

MIBiG ID

The biosynthetic gene cluster (BGC) identifier from the MIBiG repository. Clicking the ID links directly to the corresponding MIBiG entry.

Compound

The name of the natural product(s) produced by the BGC, along with the producing organism.

Biosynthetic Class

The enzyme class responsible for biosynthesis: T1PKS, NRPS, or Hybrid (PKS-NRPS).

Steps

The total number of biosynthetic steps curated for the pathway, from starter unit loading to the final scaffold product.

Pathway Table Columns

Order

The sequential step number within the biosynthetic pathway. Steps are numbered starting from 1.

Enzyme

The gene or protein responsible for catalysing the reaction at this step, as annotated in the corresponding MIBiG GenBank file.

Module

The module number within the assembly line.

0 — Loading module (starter unit is loaded onto the first ACP/T domain)
1, 2, 3, … — Elongation modules in order
TE — Thioesterase domain (product release)

Nonlinearity

Describes any deviation from standard linear module activity. This field is left blank for standard elongation steps. Possible annotations include:

Annotation	Description
`Inactive: [domain]`	The specified domain (e.g. KR, DH, ER) is present in the gene sequence but non-functional at this step
`Missing: [domain]`	The specified domain is absent from the gene sequence but its chemical transformation is observed in the product, suggesting activity in trans or by an uncharacterised enzyme
`transAT: [gene, substrate]`	The AT domain is provided in trans by a separate enzyme rather than being part of the module itself
`Iteration: [domains, substrate]`	The module is used iteratively; the domains and substrate used in each iteration are listed
`ModuleSkip`	This module is skipped in the biosynthesis of this particular product

Substrate

The compound(s) required to produce the intermediate shown in the Product column of the same row.

If multiple compounds are listed separated by ;, all are required (AND logic). For example, 1-3;Malonyl-CoA means the product of step 1-3 is extended by Malonyl-CoA.
If compounds are listed separated by / within parentheses, either may be used (OR logic), and the variable position is represented as an R-group ([R]) in the SMILES.
A ? indicates that the substrate identity is unknown from the available literature.

Product (SMILES)

The SMILES string representing the biosynthetic intermediate produced at this step. See the SMILES Notation section below for details.

Product_ID

A unique identifier for the intermediate produced at this step, in the format [BGC_number]-[step_number] (e.g. 55-3). These IDs are used in the Substrate column of subsequent steps to indicate which intermediate is carried forward, enabling tracing of the full biosynthetic trajectory.

SMILES Notation

All SMILES strings in NPdia were generated by manually tracing each biosynthetic step from the primary literature and drawing the corresponding intermediate structure in ChemDraw, followed by conversion to SMILES format.

Key conventions

Thioester terminus: The growing PKS chain is anchored to an acyl carrier protein (ACP) domain via a thioester bond during biosynthesis. In NPdia, the thioester-ACP linkage (–C(=O)–S–ACP) is represented as a free hydroxyl group (–C(=O)–OH) for simplicity and compatibility with standard cheminformatics tools.
R-groups: When a substrate step involves an OR branch (see Substrate above), the variable attachment point is denoted using standard R-group notation (e.g. [R], [R1], [R2]). These can be replaced with the appropriate substructure depending on which branch is followed.
Stereochemistry: Where stereochemistry is specified in the source literature, it is encoded in the SMILES using standard @ and @@ notation.
Incomplete structures: For steps where the full intermediate structure is not reported in the literature, the SMILES may be partial or absent. These entries are left blank rather than inferred.

Downloading Data

The full NPdia dataset is available for download from the Download page in Excel format (.xlsx).

The downloaded file contains all pathway entries with the following columns: MIBiG ID, Compound, Class, Order, Enzyme, Module, Nonlinearity, Substrate, Product (SMILES), Product_ID, and associated metadata.

Potential use cases

Machine learning: The curated intermediate SMILES can be used to train or benchmark models for biosynthesis prediction, retrosynthesis, or domain function annotation.
Pathway engineering: Step-by-step intermediates enable identification of branch points for combinatorial biosynthesis and rational pathway modification.
Cheminformatics: The dataset is compatible with standard toolkits such as RDKit and CDK for substructure analysis, similarity searching, and reaction mapping.
Cross-referencing: Product_IDs and MIBiG IDs allow direct integration with MIBiG and other BGC databases.

Contact

For questions, error reports, or suggestions for new entries, please contact:

[Contact email or GitHub Issues link — to be filled in]