Data Importation and Representation¶

Authors:

Tyrone Lee, Department of Biomedical Informatics, Harvard Medical School (tyrone_lee@hms.harvard.edu)
Tram Nguyen, Department of Biomedical Informatics, Harvard Medical School (Tram_Nguyen@hms.harvard.edu)
Pascal Notin, Department of Systems Biology, Harvard Medical School
Aaron W Kollasch, Department of Systems Biology, Harvard Medical School
Yilan Wang, Department of Systems Biology, Harvard Medical School
Debora Marks, Department of Systems Biology, Harvard Medical School
Ludwig Geistlinger, Department of Biomedical Informatics, Harvard Medical School

Package: ProteinGymPy

Date: November 17, 2025

Setup¶

Load the module and all required dependencies used in the vignette.

In [1]:

Copied!





import proteingympy as pg
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better visualization
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Configure matplotlib
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 8)
import proteingympy as pg
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better visualization
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Configure matplotlib
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 8)

/home/icarus/projects/ProteinGymPy/.venv/lib/python3.13/site-packages/nglview/__init__.py:12: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

Introduction¶

Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins to address our most pressing challenges in climate, agriculture and healthcare. Despite an increase in machine learning-based protein modeling methods, assessing the effectiveness of these models is problematic due to the use of distinct, often contrived, experimental datasets and variable performance across different protein families.

ProteinGym is a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design curated by @Notin2023. It encompasses both a broad collection of over 250 standardized deep mutational scanning (DMS) assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. Furthermore, ProteinGym reports the performance of a diverse set of over 70 high-performing models from various subfields (eg., mutation effects, inverse folding) into a unified benchmark.

ProteinGym datasets are publicly available as a community resource both on Zenodo and the official ProteinGym website under the MIT license.

Available datasets¶

The ProteinGymPy package provides the following analysis-ready datasets from ProteinGym:

DMS assay scores from 217 assays measuring the impact of all possible amino acid substitutions across 186 proteins. The dataset can be obtained using the get_dms_substitution_data() function
AlphaMissense pathogenicity scores for ~1.6 M substitutions in the ProteinGym DMS data. The data is provided with get_alphamissense_proteingym_data().
Reference file containing metadata associated with the 217 DMS assays, such as taxon, protein sequence length, UniProt ID, etc. with get_dms_metadata().
Five model performance metrics ("AUC", "MCC", "NDCG", "Spearman", "Top_recall") for 79 models across 217 assays calculated on DMS substitutions in a zero-shot setting. The data can be obtained with get_zero_shot_metrics().
Model scores on the DMS substitutions for 79 models in the zero-shot setting. Load with get_zero_shot_substitution_data().
Two model performance metrics ("Spearman", and "MSE") for 12 models across 217 assays calculated on DMS substitutions in a semi-supervised setting. Load in this data with get_supervised_metrics().
Model scores on the DMS substitutions for 12 semi-supervised models with 3 folding schemes: contiguous, modulo, and random. Loaded in with get_supervised_substitution_data() and by changing the "fold_scheme" argument, respectively.
PDB files for 197 protein structures, to be used in the plot_structure() 3D visualization function.

Data import¶

ProteinGym data on Zenodo can be obtained through ProteinGymPy via the API.

DMS data¶

Deep mutational scanning is an experimental technique that provides experimental data on the fitness effects of all possible single mutations in a protein (Fowler et al. 2024).

For each position in a protein, the amino acid residue is mutated and the fitness effects are recorded. While most mutations tend to be deleterious, some can enhance protein activity. In addition to analyzing single mutations, this method can also examine the effects of multiple mutations, yielding insights into protein structure and function. Overall, DMS scores provide a detailed map of how changes in a protein's sequence affect its function, offering valuable

Datasets in ProteinGymPy can be easily loaded with built-in functions.

In [2]:

Copied!





# Load the DICTIONARY of DMS scores for 217 assays
dms_data = pg.get_dms_substitution_data()
print(type(dms_data)) # equivalent to R list
print(dms_data.keys()) # keys are the assay names -> elements names of the R list
print(len(dms_data)) # 217 DMS assays
# Load the DICTIONARY of DMS scores for 217 assays
dms_data = pg.get_dms_substitution_data()
print(type(dms_data)) # equivalent to R list
print(dms_data.keys()) # keys are the assay names -> elements names of the R list
print(len(dms_data)) # 217 DMS assays

Using cached file at .cache/DMS_ProteinGym_substitutions.zip.
Querying UniProt API for 185 entries...
<class 'dict'>
dict_keys(['SDA_BACSU_Tsuboyama_2023_1PV0', 'PAI1_HUMAN_Huttinger_2021', 'S22A1_HUMAN_Yee_2023_activity', 'HIS7_YEAST_Pokusaeva_2019', 'AMIE_PSEAE_Wrenbeck_2017', 'ACE2_HUMAN_Chan_2020', 'RDRP_I33A0_Li_2023', 'CASP3_HUMAN_Roychowdhury_2020', 'RL40A_YEAST_Roscoe_2014', 'SRC_HUMAN_Chakraborty_2023_binding-DAS_25uM', 'TRPC_SACS2_Chan_2017', 'GLPA_HUMAN_Elazar_2016', 'BLAT_ECOLX_Jacquier_2013', 'LYAM1_HUMAN_Elazar_2016', 'PABP_YEAST_Melamed_2013', 'Q2N0S5_9HIV1_Haddox_2018', 'DOCK1_MOUSE_Tsuboyama_2023_2M0Y', 'SCN5A_HUMAN_Glazer_2019', 'MYO3_YEAST_Tsuboyama_2023_2BTT', 'CBPA2_HUMAN_Tsuboyama_2023_1O6X', 'RAD_ANTMA_Tsuboyama_2023_2CJJ', 'A4D664_9INFA_Soh_2019', 'CATR_CHLRE_Tsuboyama_2023_2AMI', 'SERC_HUMAN_Xie_2023', 'SRBS1_HUMAN_Tsuboyama_2023_2O2W', 'PPARG_HUMAN_Majithia_2016', 'AICDA_HUMAN_Gajula_2014_3cycles', 'AACC1_PSEAI_Dandage_2018', 'A4_HUMAN_Seuma_2022', 'RS15_GEOSE_Tsuboyama_2023_1A32', 'P53_HUMAN_Giacomelli_2018_WT_Nutlin', 'VG08_BPP22_Tsuboyama_2023_2GP8', 'OXDA_RHOTO_Vanella_2023_expression', 'BLAT_ECOLX_Firnberg_2014', 'AMFR_HUMAN_Tsuboyama_2023_4G3O', 'RL40A_YEAST_Roscoe_2013', 'DYR_ECOLI_Thompson_2019', 'MTHR_HUMAN_Weile_2021', 'TPOR_HUMAN_Bridgford_2020', 'SR43C_ARATH_Tsuboyama_2023_2N88', 'MSH2_HUMAN_Jia_2020', 'NPC1_HUMAN_Erwood_2022_RPE1', 'A0A192B1T2_9HIV1_Haddox_2018', 'MET_HUMAN_Estevam_2023', 'HCP_LAMBD_Tsuboyama_2023_2L6Q', 'CCR5_HUMAN_Gill_2023', 'SPA_STAAU_Tsuboyama_2023_1LP1', 'FECA_ECOLI_Tsuboyama_2023_2D1U', 'A0A140D2T1_ZIKV_Sourisseau_2019', 'PHOT_CHLRE_Chen_2023', 'SPG1_STRSG_Olson_2014', 'TADBP_HUMAN_Bolognesi_2019', 'OXDA_RHOTO_Vanella_2023_activity', 'NKX31_HUMAN_Tsuboyama_2023_2L9R', 'HSP82_YEAST_Mishra_2016', 'EPHB2_HUMAN_Tsuboyama_2023_1F0M', 'TAT_HV1BR_Fernandes_2016', 'HXK4_HUMAN_Gersing_2022_activity', 'KKA2_KLEPN_Melnikov_2014', 'Q837P4_ENTFA_Meier_2023', 'ANCSZ_Hobbs_2022', 'SCIN_STAAR_Tsuboyama_2023_2QFF', 'PA_I34A1_Wu_2015', 'MLAC_ECOLI_MacRae_2023', 'DLG4_HUMAN_Faure_2021', 'ENV_HV1BR_Haddox_2016', 'CCDB_ECOLI_Tripathi_2016', 'F7YBW8_MESOW_Ding_2023', 'P53_HUMAN_Giacomelli_2018_Null_Etoposide', 'ESTA_BACSU_Nutschel_2020', 'HSP82_YEAST_Flynn_2019', 'PKN1_HUMAN_Tsuboyama_2023_1URF', 'RASK_HUMAN_Weng_2022_abundance', 'CASP7_HUMAN_Roychowdhury_2020', 'UBE4B_HUMAN_Tsuboyama_2023_3L1X', 'P53_HUMAN_Giacomelli_2018_Null_Nutlin', 'DYR_ECOLI_Nguyen_2023', 'CP2C9_HUMAN_Amorosi_2021_activity', 'UBR5_HUMAN_Tsuboyama_2023_1I2T', 'A0A2Z5U3Z0_9INFA_Wu_2014', 'SHOC2_HUMAN_Kwon_2022', 'IF1_ECOLI_Kelsic_2016', 'DNJA1_HUMAN_Tsuboyama_2023_2LO1', 'POLG_HCVJF_Qi_2014', 'A0A1I9GEU1_NEIME_Kennouche_2019', 'BRCA1_HUMAN_Findlay_2018', 'GCN4_YEAST_Staller_2018', 'CSN4_MOUSE_Tsuboyama_2023_1UFM', 'BCHB_CHLTE_Tsuboyama_2023_2KRU', 'RFAH_ECOLI_Tsuboyama_2023_2LCL', 'CALM1_HUMAN_Weile_2017', 'HEM3_HUMAN_Loggerenberg_2023', 'P84126_THETH_Chan_2017', 'GAL4_YEAST_Kitzman_2015', 'A0A247D711_LISMN_Stadelmann_2021', 'I6TAH8_I68A0_Doud_2015', 'NUSA_ECOLI_Tsuboyama_2023_1WCL', 'P53_HUMAN_Kotler_2018', 'CAR11_HUMAN_Meitlis_2020_lof', 'POLG_DEN26_Suphatrakul_2023', 'VRPI_BPT7_Tsuboyama_2023_2WNM', 'UBC9_HUMAN_Weile_2017', 'RCRO_LAMBD_Tsuboyama_2023_1ORC', 'SAV1_MOUSE_Tsuboyama_2023_2YSB', 'ILF3_HUMAN_Tsuboyama_2023_2L33', 'SPTN1_CHICK_Tsuboyama_2023_1TUD', 'NRAM_I33A0_Jiang_2016', 'Q837P5_ENTFA_Meier_2023', 'CBS_HUMAN_Sun_2020', 'DN7A_SACS2_Tsuboyama_2023_1JIC', 'RBP1_HUMAN_Tsuboyama_2023_2KWH', 'HECD1_HUMAN_Tsuboyama_2023_3DKM', 'SPIKE_SARS2_Starr_2020_expression', 'REV_HV1H2_Fernandes_2016', 'SYUA_HUMAN_Newberry_2020', 'YAP1_HUMAN_Araya_2012', 'NCAP_I34A1_Doud_2015', 'D7PM05_CLYGR_Somermeyer_2022', 'TRPC_THEMA_Chan_2017', 'HSP82_YEAST_Cote-Hammarlof_2020_growth-H2O2', 'BLAT_ECOLX_Deng_2012', 'VKOR1_HUMAN_Chiasson_2020_abundance', 'TNKS2_HUMAN_Tsuboyama_2023_5JRT', 'SQSTM_MOUSE_Tsuboyama_2023_2RRU', 'BBC1_YEAST_Tsuboyama_2023_1TG0', 'GFP_AEQVI_Sarkisyan_2016', 'NUD15_HUMAN_Suiter_2020', 'UBE4B_MOUSE_Starita_2013', 'MTH3_HAEAE_RockahShmuel_2015', 'KCNE1_HUMAN_Muhammad_2023_expression', 'ENVZ_ECOLI_Ghose_2023', 'SRC_HUMAN_Nguyen_2022', 'A4GRB6_PSEAI_Chen_2020', 'MAFG_MOUSE_Tsuboyama_2023_1K1V', 'ARGR_ECOLI_Tsuboyama_2023_1AOY', 'SPIKE_SARS2_Starr_2020_binding', 'R1AB_SARS2_Flynn_2022', 'RASK_HUMAN_Weng_2022_binding-DARPin_K55', 'ODP2_GEOSE_Tsuboyama_2023_1W4G', 'CCDB_ECOLI_Adkar_2012', 'Q59976_STRSQ_Romero_2015', 'SBI_STAAM_Tsuboyama_2023_2JVG', 'THO1_YEAST_Tsuboyama_2023_2WQG', 'A0A2Z5U3Z0_9INFA_Doud_2016', 'POLG_CXB3N_Mattenberger_2021', 'ENV_HV1B9_DuenasDecamp_2016', 'RPC1_BP434_Tsuboyama_2023_1R69', 'TPK1_HUMAN_Weile_2017', 'RL40A_YEAST_Mavor_2016', 'Q53Z42_HUMAN_McShan_2019_binding-TAPBPR', 'Q53Z42_HUMAN_McShan_2019_expression', 'PRKN_HUMAN_Clausen_2023', 'S22A1_HUMAN_Yee_2023_abundance', 'RL20_AQUAE_Tsuboyama_2023_1GYZ', 'SUMO1_HUMAN_Weile_2017', 'DLG4_RAT_McLaughlin_2012', 'RD23A_HUMAN_Tsuboyama_2023_1IFY', 'PPM1D_HUMAN_Miller_2022', 'ISDH_STAAW_Tsuboyama_2023_2LHR', 'BLAT_ECOLX_Stiffler_2015', 'LGK_LIPST_Klesmith_2015', 'PR40A_HUMAN_Tsuboyama_2023_1UZC', 'CD19_HUMAN_Klesmith_2019_FMC_singles', 'SRC_HUMAN_Ahler_2019', 'KCNE1_HUMAN_Muhammad_2023_function', 'ADRB2_HUMAN_Jones_2020', 'OTU7A_HUMAN_Tsuboyama_2023_2L2D', 'HXK4_HUMAN_Gersing_2023_abundance', 'OPSD_HUMAN_Wan_2019', 'NPC1_HUMAN_Erwood_2022_HEK293T', 'GRB2_HUMAN_Faure_2021', 'HMDH_HUMAN_Jiang_2019', 'RCD1_ARATH_Tsuboyama_2023_5OAO', 'KCNJ2_MOUSE_Coyote-Maestas_2022_function', 'RAF1_HUMAN_Zinkus-Boltz_2019', 'RPC1_LAMBD_Li_2019_high-expression', 'NUSG_MYCTU_Tsuboyama_2023_2MI6', 'PSAE_PICP2_Tsuboyama_2023_1PSE', 'ERBB2_HUMAN_Elazar_2016', 'CAPSD_AAV2S_Sinai_2021', 'SC6A4_HUMAN_Young_2021', 'MK01_HUMAN_Brenan_2016', 'GDIA_HUMAN_Silverstein_2021', 'PTEN_HUMAN_Matreyek_2021', 'TPMT_HUMAN_Matreyek_2018', 'F7YBW8_MESOW_Aakre_2015', 'Q8WTC7_9CNID_Somermeyer_2022', 'RNC_ECOLI_Weeks_2023', 'YAIA_ECOLI_Tsuboyama_2023_2KVT', 'B2L11_HUMAN_Dutta_2010_binding-Mcl-1', 'Q6WV12_9MAXI_Somermeyer_2022', 'PITX2_HUMAN_Tsuboyama_2023_2L7M', 'CAS9_STRP1_Spencer_2017_positive', 'CUE1_YEAST_Tsuboyama_2023_2MYX', 'CP2C9_HUMAN_Amorosi_2021_abundance', 'VILI_CHICK_Tsuboyama_2023_1YU5', 'C6KNH7_9INFA_Lee_2018', 'BRCA2_HUMAN_Erwood_2022_HEK293T', 'SPG2_STRSG_Tsuboyama_2023_5UBS', 'PTEN_HUMAN_Mighell_2018', 'OTC_HUMAN_Lo_2023', 'RPC1_LAMBD_Li_2019_low-expression', 'CAR11_HUMAN_Meitlis_2020_gof', 'PIN1_HUMAN_Tsuboyama_2023_1I6C', 'SPG1_STRSG_Wu_2016', 'TCRG1_MOUSE_Tsuboyama_2023_1E0L', 'SOX30_HUMAN_Tsuboyama_2023_7JJK', 'CBX4_HUMAN_Tsuboyama_2023_2K28', 'KCNH2_HUMAN_Kozek_2020', 'VKOR1_HUMAN_Chiasson_2020_activity', 'OBSCN_HUMAN_Tsuboyama_2023_1V1C', 'KCNJ2_MOUSE_Coyote-Maestas_2022_surface', 'FKBP3_HUMAN_Tsuboyama_2023_2KFV', 'MBD11_ARATH_Tsuboyama_2023_6ACV', 'YNZC_BACSU_Tsuboyama_2023_2JVD', 'RASH_HUMAN_Bandaru_2017', 'POLG_PESV_Tsuboyama_2023_2MXD'])
217

View an example of one DMS assay.

In [3]:

Copied!





# Grab the first assay
first_key = list(dms_data.keys())[0]
print(type(dms_data[first_key])) # Now we see this is a PANDAS DataFrame
print(dms_data[first_key].head()) # Display the first few rows of the DataFrame

# Grab the first assay
first_key = list(dms_data.keys())[0]
print(type(dms_data[first_key])) # Now we see this is a PANDAS DataFrame
print(dms_data[first_key].head()) # Display the first few rows of the DataFrame

<class 'pandas.core.frame.DataFrame'>
  UniProt_id                         DMS_id mutant  \
0     Q7WY62  SDA_BACSU_Tsuboyama_2023_1PV0   A16C   
1     Q7WY62  SDA_BACSU_Tsuboyama_2023_1PV0   A16D   
2     Q7WY62  SDA_BACSU_Tsuboyama_2023_1PV0   A16E   
3     Q7WY62  SDA_BACSU_Tsuboyama_2023_1PV0   A16F   
4     Q7WY62  SDA_BACSU_Tsuboyama_2023_1PV0   A16G   

                               mutated_sequence  DMS_score DMS_score_bin  
0  MRKLSDELLIESYFKCTEMNLNRDFIELIENEIKRRSLGHIISV  -0.533935             1  
1  MRKLSDELLIESYFKDTEMNLNRDFIELIENEIKRRSLGHIISV  -2.151397             0  
2  MRKLSDELLIESYFKETEMNLNRDFIELIENEIKRRSLGHIISV  -0.870078             1  
3  MRKLSDELLIESYFKFTEMNLNRDFIELIENEIKRRSLGHIISV  -0.328954             1  
4  MRKLSDELLIESYFKGTEMNLNRDFIELIENEIKRRSLGHIISV  -0.961885             1

For each DMS assay, the columns show the UniProt protein identifier, the DMS experiment assay identifier, the amino acid substitution at a given sequence position, the mutated protein sequence, the recorded DMS score, and a binary DMS score bin categorizing whether the mutation has an effect on fitness (1) or not (0). For details, see the get_dms_substitution_data documentation or reference publication from Notin et al. 2023.

In [4]:

Copied!

help(pg.get_dms_substitution_data)
help(pg.get_dms_substitution_data)

Help on function get_dms_substitution_data in module proteingympy.make_dms_substitutions:

get_dms_substitution_data(cache_dir: str = '.cache', use_cache: bool = True) -> Dict[str, pandas.core.frame.DataFrame]
    Download and process ProteinGym DMS substitution data.

    Returns a dictionary of 217 DMS assays, each as a pandas DataFrame with columns:
    - UniProt_id: UniProt accession identifier
    - DMS_id: DMS assay identifier
    - mutant: substitution description (e.g. A1P:D2N)
    - mutated_sequence: full amino acid sequence
    - DMS_score: experimental measurement (higher = more fit)
    - DMS_score_bin: binary fitness (1=fit, 0=not fit)

    Args:
        cache_dir: Directory to cache downloaded files
        use_cache: If True, use cached file if it exists. If False, force a fresh download.

    Returns:
        Dictionary mapping DMS study names to DataFrames

Model benchmarking¶

The function pg.benchmark_models() can be used to compare performance across several variant effect prediction models when using the DMS data as ground truth. This function takes in one of the five available metrics, and and compares the performance of up to 5 out of the 79 available models.

In the zero-shot setting, the effects of mutations on fitness are predicted without relying on ground-truth labels for the protein of interest. Robust zero-shot performance is particularly informative when labels are subject to several biases or scarcely available (e.g., labels for rare genetic pathologies).

Model performance was evaluated across 5 metrics:

Spearman's rank correlation coefficient (default metric)
Area Under the ROC Curve (AUC)
Matthews Correlation Coefficient (MCC), most suitable for bimodal DMS measurements
Normalized Discounted Cumulative Gains (NDCG)
Top K Recall (top 10% of DMS values)

To account for the fact that certain protein functions are overrepresented in the list of proteins assayed with DMS (e.g., thermostability), these metrics were first calculated within groups of proteins with similar functions. The final value of the metric is then the average of these averages, giving each functional group equal weight. The final values are referred to as the 'corrected average'.

Due to the often non-linear relationship between protein function and organism fitness, the Spearman’s rank correlation coefficient is typically an appropriate choice for evaluating model performance against experimental measurements. However, in situations where DMS measurements exhibit a bimodal profile, rank correlations may not be the optimal choice. Therefore, additional metrics are also provided, such as the Area Under the ROC Curve (AUC) and the Matthews Correlation Coefficient (MCC), which compare binarized model scores and experimental measurements. Furthermore, for certain goals (e.g., optimizing functional properties of designed proteins), it is more important that a model is able to correctly identify the most functional protein variants, rather than properly capture the overall distribution of all assayed variants. For such scenarios, it is beneficial to use the Normalized Discounted Cumulative Gains (NDCG) which prioritizes models that return high scores for sequences with high DMS value (corresponding to strong gain in fitness). Alternatively, the Top K Recall (with K being set to the top 10% of DMS values) can also be informative for such scenarios.

To view all available zero-shot models, use the function: pg.available_models().

In [5]:

Copied!

zmodels = pg.available_models()
print(zmodels)
zmodels = pg.available_models()
print(zmodels)

['Site_Independent', 'EVmutation', 'DeepSequence_single', 'DeepSequence_ensemble', 'EVE_single', 'EVE_ensemble', 'Unirep', 'Unirep_evotune', 'MSA_Transformer_single', 'MSA_Transformer_ensemble', 'ESM1b', 'ESM1v_single', 'ESM1v_ensemble', 'ESM2_8M', 'ESM2_35M', 'ESM2_150M', 'ESM2_650M', 'ESM2_3B', 'ESM2_15B', 'Wavenet', 'RITA_s', 'RITA_m', 'RITA_l', 'RITA_xl', 'Progen2_small', 'Progen2_medium', 'Progen2_base', 'Progen2_large', 'Progen2_xlarge', 'GEMME', 'VESPA', 'VESPAl', 'VespaG', 'ProtGPT2', 'Tranception_S_no_retrieval', 'Tranception_M_no_retrieval', 'Tranception_L_no_retrieval', 'Tranception_S', 'Tranception_M', 'Tranception_L', 'TranceptEVE_S', 'TranceptEVE_M', 'TranceptEVE_L', 'CARP_38M', 'CARP_600K', 'CARP_640M', 'CARP_76M', 'MIF', 'MIFST', 'ESM_IF1', 'ProteinMPNN', 'ProtSSN_k10_h512', 'ProtSSN_k10_h768', 'ProtSSN_k10_h1280', 'ProtSSN_k20_h512', 'ProtSSN_k20_h768', 'ProtSSN_k20_h1280', 'ProtSSN_k30_h512', 'ProtSSN_k30_h768', 'ProtSSN_k30_h1280', 'ProtSSN_ensemble', 'SaProt_650M_AF2', 'SaProt_35M_AF2', 'PoET', 'MULAN_small', 'ProSST_20', 'ProSST_128', 'ProSST_512', 'ProSST_1024', 'ProSST_2048', 'ProSST_4096', 'ESCOTT', 'VenusREM', 'RSALOR', 'S2F', 'S2F_MSA', 'S3F', 'S3F_MSA', 'SiteRM']

Now, plot the AUC metric for 5 models.

In [6]:

Copied!

fig = pg.benchmark_models(metric = "AUC", 
    models = ["GEMME", "CARP_600K", "EVmutation", "VESPA", "ProtGPT2"])
plt.show()
fig = pg.benchmark_models(metric = "AUC", 
    models = ["GEMME", "CARP_600K", "EVmutation", "VESPA", "ProtGPT2"])
plt.show()

Downloading benchmarks from https://zenodo.org/records/14997691/files/DMS_benchmark_performance.zip?download=1...
Download complete.
Loading AUC scores...
Loading MCC scores...
Loading NDCG scores...
Loading Spearman scores...
Loading Top_recall scores...
Benchmark data consistency verified.

No description has been provided for this image

Here, GEMME performed the best, achieving highest AUC of the 5 selected models. If not specified by the user, Spearman correlation is used as the default metric. For more information about the models and metrics, see the function documentation for benchmark_models().

References¶

Notin, P., et al. (2023). ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. Advances in Neural Information Processing Systems, 36.
Fowler, D. M. and Fields, S. (2014). Deep mutational scanning: a new style of protein science. Nature Methods, 11, 801--807.
Chan, K. K., et al. (2020). Engineering human ACE2 to optimize binding to the spike protein of SARS coronavirus 2. Science, 369(6508), 1261-1265.
Lee, J. M., et al. (2018). Deep mutational scanning of hemagglutinin helps predict evolutionary fates of human H3N2 influenza variants. Proceedings of the National Academy of Sciences, 115(35), E8276-E8285.