PKAD-R

Understanding pKa values in ionizable protein residues is critical for understanding fundamental protein properties, such as structure, function and interactions. We present a new version of PKAD, named PKAD-R, which is a curated database of experimentally determined protein pKa values. The database builds upon its predecessors, PKAD and PKAD-2, with significant updates and improvements through: (1) careful data curation to remove incorrect entries and consolidate redundant entries by offering alternative structures and pKa values for each unique residue (2) database redesign, to enhance its usability by adding additional information such as protein and species names, detailed notes, as well as sequence identity (3) database expansion through identification of 167 new (95 non-redundant) pKa entries from the literature. The database currently contains 868 unique pKa entries for wild type structures and 144 for mutant structures, however, we aim to keep updating the database with new entries. The PKAD-R database is available as a stand-alone downloadable file as well as a web server. The database is designed to provide both a set of pKa entries for unique residues suitable for machine learning applications, as well as modularity by providing alternative pKa values and structures, allowing the user to decide which entries to include.

We encourage/ask experimental investigators to submit their pKa data for inclusion in the database via email to: Ana Damjanovic [email protected]

Please acknowledge the use of PKAD-R in your work:
Ada Y. Chen1, Shailesh Kumar Panday2, Emil Alexov2, Bernard R. Brooks1, Ana Damjanovic3,*
Laboratory of Computational Biology, National Heart, Lung and Blood Institute, NIH, Bethesda, MD 20892
Department of Physics and Astronomy, Clemson University, Clemson, SC 29634
Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218

pKa Data Descriptions

“Protein Name” and “Species”

These two columns provide protein names and species for each entry, enabling quick and efficient searches. Users can quickly determine whether a pKa value from the literature is already included in the database by searching for the protein name and species.

“pKa Classification”

The column categorizes each entry as “Main,” “Alt. pKa,” “Alt. pKa (mutant),” or “Alt. pKa (state).” Entries labeled as “Main” represent the most recommended pKa value for a given residue, paired with the most appropriate PDB structure. The label “Alt. pKa” indicates another pKa measurement for the same residue in the same protein as the “Main” entry. “Alt. pKa (mutant)” refers to a pKa measured for the same residue in a mutated version of the protein. “Alt. pKa (state)” denotes a pKa measured in a different state of the protein, such as deoxyhemoglobin versus oxyhemoglobin. For more details on selection of “Alt. pKa (mutant)” and “Alt. pKa (state)” see the description in the “Database Curation” section in the paper.

“Alternative PDBs”

The column lists additional PDB structures that are also available for use and similar to the primary PDB structure listed in the “PDB” column.

“Sequence Identity > 30%” and “Sequence Identity > 90%”

These two columns list chains in the format PDB-ID.Chain-ID (e.g., “1EX3.A”) included in this database that have sequence identities greater than 30% and 90%, respectively, compared to the sequence of the current entry’s chain. Sequence identity is calculated using the PairwiseAligner class from the Bio.Align module within the Biopython library.

“ResID in PDB” and “ResID in pKa paper”

The column “ResID in PDB” lists the residue ID as it appears in the corresponding PDB structure, while “ResID in pKa paper” indicates the residue ID referenced in the original pKa publication when it differs from the PDB.

“Notes”

This column provides, for each residue, details about the selection of the most appropriate pKa value and PDB structure for the “Main” entry, as well as any additional relevant information.

“Warning”

icolumn labels entries under specific conditions: 1) when the pKa is a range or an approximation (labeled as “pKa: range or ~”); 2) when the residue is the C-terminus or N-terminus (labeled as “C/N-term”); 3) when the residue does not exist in the PDB structure but is present in the protein (labeled as “ResID NOT exist”), likely due to its high flexibility and disorder, which makes accurate structural definition difficult; 4) when a mutated structure is used for a wildtype protein, but the mutation is distant from the targeted residue, allowing the structure to be approximately treated as wildtype (labeled as ‘approx. WT’). This column helps users quickly filter out entries that may not meet their needs, such as those unsuitable for direct use in machine learning studies.