Skip to contents

This function interacts with the CompTox Chemistry Dashboard to download and extract a wide range of chemical data based on user-defined search criteria. It allows for flexible input types and supports downloading various chemical properties, identifiers, and predictive data. It was inspired by the ECOTOXr::websearch_comptox function.

Usage

extr_comptox(
  ids,
  download_items = c("DTXCID", "CASRN", "INCHIKEY", "IUPAC_NAME", "SMILES",
    "INCHI_STRING", "MS_READY_SMILES", "QSAR_READY_SMILES", "MOLECULAR_FORMULA",
    "AVERAGE_MASS", "MONOISOTOPIC_MASS", "QC_LEVEL", "SAFETY_DATA", "EXPOCAST",
    "DATA_SOURCES", "TOXVAL_DATA", "NUMBER_OF_PUBMED_ARTICLES", "PUBCHEM_DATA_SOURCES",
    "CPDAT_COUNT", "IRIS_LINK", "PPRTV_LINK", "WIKIPEDIA_ARTICLE", "QC_NOTES",
    "ABSTRACT_SHIFTER", "TOXPRINT_FINGERPRINT", "ACTOR_REPORT", "SYNONYM_IDENTIFIER",
    "RELATED_RELATIONSHIP", "ASSOCIATED_TOXCAST_ASSAYS", 
     "TOXVAL_DETAILS",
    "CHEMICAL_PROPERTIES_DETAILS", "BIOCONCENTRATION_FACTOR_TEST_PRED",
    "BOILING_POINT_DEGC_TEST_PRED", "48HR_DAPHNIA_LC50_MOL/L_TEST_PRED",
    "DENSITY_G/CM^3_TEST_PRED", "DEVTOX_TEST_PRED",
    "96HR_FATHEAD_MINNOW_MOL/L_TEST_PRED", "FLASH_POINT_DEGC_TEST_PRED",
    "MELTING_POINT_DEGC_TEST_PRED", "AMES_MUTAGENICITY_TEST_PRED",
    "ORAL_RAT_LD50_MOL/KG_TEST_PRED", "SURFACE_TENSION_DYN/CM_TEST_PRED",
    "THERMAL_CONDUCTIVITY_MW/(M*K)_TEST_PRED",
    "TETRAHYMENA_PYRIFORMIS_IGC50_MOL/L_TEST_PRED", "VISCOSITY_CP_CP_TEST_PRED", 
    
    "VAPOR_PRESSURE_MMHG_TEST_PRED", "WATER_SOLUBILITY_MOL/L_TEST_PRED",
    "ATMOSPHERIC_HYDROXYLATION_RATE_(AOH)_CM3/MOLECULE*SEC_OPERA_PRED",
    "BIOCONCENTRATION_FACTOR_OPERA_PRED",
    "BIODEGRADATION_HALF_LIFE_DAYS_DAYS_OPERA_PRED", "BOILING_POINT_DEGC_OPERA_PRED",
    "HENRYS_LAW_ATM-M3/MOLE_OPERA_PRED", "OPERA_KM_DAYS_OPERA_PRED",
    "OCTANOL_AIR_PARTITION_COEFF_LOGKOA_OPERA_PRED",
    "SOIL_ADSORPTION_COEFFICIENT_KOC_L/KG_OPERA_PRED",
    "OCTANOL_WATER_PARTITION_LOGP_OPERA_PRED", "MELTING_POINT_DEGC_OPERA_PRED", 
    
    "OPERA_PKAA_OPERA_PRED", "OPERA_PKAB_OPERA_PRED", "VAPOR_PRESSURE_MMHG_OPERA_PRED",
    "WATER_SOLUBILITY_MOL/L_OPERA_PRED",
    "EXPOCAST_MEDIAN_EXPOSURE_PREDICTION_MG/KG-BW/DAY", "NHANES",
    "TOXCAST_NUMBER_OF_ASSAYS/TOTAL", "TOXCAST_PERCENT_ACTIVE"),
  mass_error = 0,
  verify_ssl = FALSE,
  ...
)

Arguments

ids

A character vector containing the items to be searched within the CompTox Chemistry Dashboard. These can be chemical names, CAS Registry Numbers (CASRN), InChIKeys, or DSSTox substance identifiers (DTXSID).

download_items

A character vector of items to be downloaded. This includes a comprehensive set of chemical properties, identifiers, predictive data, and other relevant information. By Default, it download all the info

DTXCID

The unique identifier for a chemical in the EPA's CompTox Chemicals Dashboard.

CASRN

The Chemical Abstracts Service Registry Number, a unique numerical identifier for chemical substances.

INCHIKEY

The hashed version of the full International Chemical Identifier (InChI) string.

IUPAC_NAME

The International Union of Pure and Applied Chemistry (IUPAC) name of the chemical.

SMILES

The Simplified Molecular Input Line Entry System (SMILES) representation of the chemical structure.

INCHI_STRING

The full International Chemical Identifier (InChI) string.

MS_READY_SMILES

The SMILES representation of the chemical structure, prepared for mass spectrometry analysis.

QSAR_READY_SMILES

The SMILES representation of the chemical structure, prepared for quantitative structure-activity relationship (QSAR) modeling.

MOLECULAR_FORMULA

The chemical formula representing the number and type of atoms in a molecule.

AVERAGE_MASS

The average mass of the molecule, calculated based on the isotopic distribution of the elements.

MONOISOTOPIC_MASS

The mass of the molecule calculated using the most abundant isotope of each element.

QC_LEVEL

The quality control level of the data.

SAFETY_DATA

Safety information related to the chemical.

EXPOCAST

Exposure predictions from the EPA's ExpoCast program.

DATA_SOURCES

Sources of the data provided.

TOXVAL_DATA

Toxicological values related to the chemical.

NUMBER_OF_PUBMED_ARTICLES

The number of articles related to the chemical in PubMed.

PUBCHEM_DATA_SOURCES

Sources of data from PubChem.

CPDAT_COUNT

The number of entries in the Chemical and Product Categories Database (CPDat).

IRIS_LINK

Link to the EPA's Integrated Risk Information System (IRIS) entry for the chemical.

PPRTV_LINK

Link to the EPA's Provisional Peer-Reviewed Toxicity Values (PPRTV) entry for the chemical.

WIKIPEDIA_ARTICLE

Link to the Wikipedia article for the chemical.

QC_NOTES

Notes related to the quality control of the data.

ABSTRACT_SHIFTER

Information related to the abstract shifter.

TOXPRINT_FINGERPRINT

The ToxPrint chemoinformatics fingerprint of the chemical.

ACTOR_REPORT

The Aggregated Computational Toxicology Resource (ACTOR) report for the chemical.

SYNONYM_IDENTIFIER

Identifiers for synonyms of the chemical.

RELATED_RELATIONSHIP

Information on related chemicals.

ASSOCIATED_TOXCAST_ASSAYS

Assays associated with the chemical in the ToxCast database.

TOXVAL_DETAILS

Details of toxicological values.

CHEMICAL_PROPERTIES_DETAILS

Details of the chemical properties.

BIOCONCENTRATION_FACTOR_TEST_PRED

Predicted bioconcentration factor from tests.

BOILING_POINT_DEGC_TEST_PRED

Predicted boiling point in degrees Celsius from tests.

48HR_DAPHNIA_LC50_MOL/L_TEST_PRED

Predicted 48-hour LC50 for Daphnia in mol/L from tests.

DENSITY_G/CM^3_TEST_PRED

Predicted density in g/cm³ from tests.

DEVTOX_TEST_PRED

Predicted developmental toxicity from tests.

96HR_FATHEAD_MINNOW_MOL/L_TEST_PRED

Predicted 96-hour LC50 for fathead minnow in mol/L from tests.

FLASH_POINT_DEGC_TEST_PRED

Predicted flash point in degrees Celsius from tests.

MELTING_POINT_DEGC_TEST_PRED

Predicted melting point in degrees Celsius from tests.

AMES_MUTAGENICITY_TEST_PRED

Predicted Ames mutagenicity from tests.

ORAL_RAT_LD50_MOL/KG_TEST_PRED

Predicted oral LD50 for rats in mol/kg from tests.

SURFACE_TENSION_DYN/CM_TEST_PRED

Predicted surface tension in dyn/cm from tests.

THERMAL_CONDUCTIVITY_MW_M×K_TEST_PRED

Predicted thermal conductivity in mW/m×K from tests.

TETRAHYMENA_PYRIFORMIS_IGC50_MOL/L_TEST_PRED

Predicted IGC50 for Tetrahymena pyriformis in mol/L from tests.

VISCOSITY_CP_CP_TEST_PRED

Predicted viscosity in cP from tests.

VAPOR_PRESSURE_MMHG_TEST_PRED

Predicted vapor pressure in mmHg from tests.

WATER_SOLUBILITY_MOL/L_TEST_PRED

Predicted water solubility in mol/L from tests.

ATMOSPHERIC_HYDROXYLATION_RATE_\(AOH\)_CM3/MOLECULE\*SEC_OPERA_PRED

Predicted atmospheric hydroxylation rate in cm³/molecule\*sec from OPERA.

BIOCONCENTRATION_FACTOR_OPERA_PRED

Predicted bioconcentration factor from OPERA.

BIODEGRADATION_HALF_LIFE_DAYS_DAYS_OPERA_PRED

Predicted biodegradation half-life in days from OPERA.

BOILING_POINT_DEGC_OPERA_PRED

Predicted boiling point in degrees Celsius from OPERA.

HENRYS_LAW_ATM-M3/MOLE_OPERA_PRED

Predicted Henry's law constant in atm-m³/mole from OPERA.

OPERA_KM_DAYS_OPERA_PRED

Predicted Km in days from OPERA.

OCTANOL_AIR_PARTITION_COEFF_LOGKOA_OPERA_PRED

Predicted octanol-air partition coefficient (log Koa) from OPERA.

SOIL_ADSORPTION_COEFFICIENT_KOC_L/KG_OPERA_PRED

Predicted soil adsorption coefficient (Koc) in L/kg from OPERA.

OCTANOL_WATER_PARTITION_LOGP_OPERA_PRED

Predicted octanol-water partition coefficient (log P) from OPERA.

MELTING_POINT_DEGC_OPERA_PRED

Predicted melting point in degrees Celsius from OPERA.

OPERA_PKAA_OPERA_PRED

Predicted pKa (acidic) from OPERA.

OPERA_PKAB_OPERA_PRED

Predicted pKa (basic) from OPERA.

VAPOR_PRESSURE_MMHG_OPERA_PRED

Predicted vapor pressure in mmHg from OPERA.

WATER_SOLUBILITY_MOL/L_OPERA_PRED

Predicted water solubility in mol/L from OPERA.

EXPOCAST_MEDIAN_EXPOSURE_PREDICTION_MG/KG-BW/DAY

Predicted median exposure from ExpoCast in mg/kg-bw/day.

NHANES

National Health and Nutrition Examination Survey data.

TOXCAST_NUMBER_OF_ASSAYS/TOTAL

Number of assays in ToxCast.

TOXCAST_PERCENT_ACTIVE

Percentage of active assays in ToxCast.

mass_error

Numeric value indicating the mass error tolerance for searches involving mass data. Default is 0.

verify_ssl

Logical value indicating whether SSL certificates should be verified. Default is FALSE. Note that this argument is not used on linux OS.

...

Additional arguments passed to httr2::req_options(). Note that this argument is not used on linux OS.

Value

A cleaned data frame containing the requested data from CompTox.

Details

Please note that this function, which pulls data from EPA servers, may encounter issues on some Linux systems. This is because those servers do not accept secure legacy renegotiation. On Linux systems, the current function depends on curl and OpenSSL, which have known problems with unsafe legacy renegotiation in newer versions. One workaround is to downgrade to curl v7.78.0 and OpenSSL v1.1.1. However, please be aware that using these older versions might introduce potential security vulnerabilities. Refer to this gist for instructions on how to downgrade curl and OpenSSL on Ubuntu.

Examples

# \donttest{
# Example usage of the function:
extr_comptox(ids = c("Aspirin", "50-00-0"))
#>  Sending request to CompTox...
#> Request succeeded with status code: 202
#>  Getting info from CompTox...
#> Request succeeded with status code: 200
#> $comptox_cover_sheet
#> # A tibble: 4 × 2
#>   `Search datestamp` `2024-12-04 14:20:46`
#>   <chr>                              <dbl>
#> 1 Search term count                      2
#> 2 Found count                            2
#> 3 Not found count                        0
#> 4 Duplicate count                        0
#> 
#> $comptox_main_data
#> # A tibble: 2 × 64
#>   INPUT   FOUND_BY      PREFERRED_NAME DTXCID   CASRN INCHIKEY IUPAC_NAME SMILES
#>   <chr>   <chr>         <chr>          <chr>    <chr> <chr>    <chr>      <chr> 
#> 1 Aspirin Approved Name Aspirin        DTXCID5… 50-7… BSYNRYM… 2-(Acetyl… CC(=O…
#> 2 50-00-0 CASRN         Formaldehyde   DTXCID3… 50-0… WSFSSNU… Formaldeh… C=O   
#> # ℹ 56 more variables: INCHI_STRING <chr>, MS_READY_SMILES <chr>,
#> #   QSAR_READY_SMILES <chr>, MOLECULAR_FORMULA <chr>, AVERAGE_MASS <dbl>,
#> #   MONOISOTOPIC_MASS <dbl>, QC_LEVEL <dbl>, SAFETY_DATA <chr>, EXPOCAST <chr>,
#> #   DATA_SOURCES <dbl>, TOXVAL_DATA <chr>, NUMBER_OF_PUBMED_ARTICLES <dbl>,
#> #   PUBCHEM_DATA_SOURCES <dbl>, CPDAT_COUNT <dbl>, IRIS_LINK <chr>,
#> #   PPRTV_LINK <lgl>, WIKIPEDIA_ARTICLE <chr>, QC_NOTES <chr>,
#> #   TOXPRINT_FINGERPRINT <chr>, ACTOR_REPORT <chr>, …
#> 
#> $comptox_abstract_sifter
#> # A tibble: 2 × 3
#>   DSSTOX_LINK_TO_DASHBOARD PREFERRED_NAME `CHEMICAL/ENTITY_QUERY`
#>   <chr>                    <chr>          <chr>                  
#> 1 DTXSID5020108            Aspirin        50-78-2 OR Aspirin     
#> 2 DTXSID7020637            Formaldehyde   50-00-0 OR Formaldehyde
#> 
#> $comptox_synonym_identifier
#> # A tibble: 2 × 3
#>   SEARCHED_CHEMICAL IDENTIFIER                                        `PC-CODES`
#>   <chr>             <chr>                                             <chr>     
#> 1 Aspirin           Synonym data is too big for the Excel cell - Ref… PC-129061 
#> 2 Formaldehyde      NSC 298885|UN 2209|Formalin 40|Superlysoform|For… PC-043001 
#> 
#> $comptox_related_relationships
#> # A tibble: 60 × 7
#>    INPUT   DTXSID        PREFERRED_NAME HAS_RELATIONSHIP_WITH  RELATED_DTXSID 
#>    <chr>   <chr>         <chr>          <chr>                  <chr>          
#>  1 Aspirin DTXSID5020108 Aspirin        Searched Chemical      DTXSID5020108  
#>  2 Aspirin DTXSID5020108 Aspirin        Predecessor: Component DTXSID0020109  
#>  3 Aspirin DTXSID5020108 Aspirin        Predecessor: Component DTXSID701336718
#>  4 Aspirin DTXSID5020108 Aspirin        Transformation Product DTXSID5021708  
#>  5 50-00-0 DTXSID7020637 Formaldehyde   Searched Chemical      DTXSID7020637  
#>  6 50-00-0 DTXSID7020637 Formaldehyde   Predecessor: Component DTXSID6029709  
#>  7 50-00-0 DTXSID7020637 Formaldehyde   Predecessor: Component DTXSID6029757  
#>  8 50-00-0 DTXSID7020637 Formaldehyde   Predecessor: Component DTXSID60873853 
#>  9 50-00-0 DTXSID7020637 Formaldehyde   Predecessor: Component DTXSID60905168 
#> 10 50-00-0 DTXSID7020637 Formaldehyde   Predecessor: Component DTXSID6094144  
#> # ℹ 50 more rows
#> # ℹ 2 more variables: RELATED_PREFERRED_NAME <chr>, RELATED_CASRN <chr>
#> 
#> $comptox_toxcast_assays_ac50
#> # A tibble: 1,485 × 3
#>    INPUT                            50-00-0_DTXSID702063…¹ ASPIRIN_DTXSID5020108
#>    <chr>                            <chr>                  <chr>                
#>  1 ACEA_AR_agonist_80hr             -                      1000000.0            
#>  2 ACEA_AR_agonist_AUC_viability    -                      1000000.0            
#>  3 ACEA_AR_antagonist_80hr          -                      1000000.0            
#>  4 ACEA_AR_antagonist_AUC_viability -                      1000000.0            
#>  5 ACEA_ER_80hr                     -                      1000000.0            
#>  6 ACEA_ER_AUC_viability            -                      1000000.0            
#>  7 APR_HepG2_CellCycleArrest_1hr    -                      -                    
#>  8 APR_HepG2_CellCycleArrest_24hr   -                      1000000.0            
#>  9 APR_HepG2_CellCycleArrest_72hr   -                      1000000.0            
#> 10 APR_HepG2_CellLoss_1hr           -                      -                    
#> # ℹ 1,475 more rows
#> # ℹ abbreviated name: ¹​`50-00-0_DTXSID7020637`
#> 
#> $comptox_toxval_details
#> # A tibble: 158 × 63
#>    SEARCHED_CHEMICAL DTXSID        CASRN   NAME    SOURCE SUB_SOURCE TOXVAL_TYPE
#>    <chr>             <chr>         <chr>   <chr>   <chr>  <chr>      <chr>      
#>  1 Aspirin           DTXSID5020108 50-78-2 Aspirin NLM C… -          LD50       
#>  2 Aspirin           DTXSID5020108 50-78-2 Aspirin NLM C… -          LD50       
#>  3 Aspirin           DTXSID5020108 50-78-2 Aspirin NLM C… -          LD50       
#>  4 Aspirin           DTXSID5020108 50-78-2 Aspirin GESTI… -          DNEL syste…
#>  5 Aspirin           DTXSID5020108 50-78-2 Aspirin DOD M… TLVadj     MEG        
#>  6 Aspirin           DTXSID5020108 50-78-2 Aspirin DOD M… TLV_TWA    MEG        
#>  7 Aspirin           DTXSID5020108 50-78-2 Aspirin DOD M… TLV_TWA    MEG        
#>  8 Aspirin           DTXSID5020108 50-78-2 Aspirin ECHA … Developme… NOAEL      
#>  9 Aspirin           DTXSID5020108 50-78-2 Aspirin ECHA … Developme… NOAEL      
#> 10 Aspirin           DTXSID5020108 50-78-2 Aspirin EPA E… EPA ORD    LOEL       
#> # ℹ 148 more rows
#> # ℹ 56 more variables: TOXVAL_SUBTYPE <chr>, TOXVAL_TYPE_SUPERCATEGORY <chr>,
#> #   QUALIFIER <chr>, TOXVAL_NUMERIC <dbl>, TOXVAL_UNITS <chr>,
#> #   RISK_ASSESSMENT_CLASS <chr>, STUDY_TYPE <chr>, STUDY_DURATION_CLASS <chr>,
#> #   STUDY_DURATION_VALUE <dbl>, STUDY_DURATION_UNITS <chr>,
#> #   SPECIES_COMMON <chr>, STRAIN <chr>, LATIN_NAME <chr>,
#> #   SPECIES_SUPERCATEGORY <chr>, SEX <chr>, GENERATION <chr>, …
#> 
#> $comptox_chemical_properties
#> # A tibble: 101 × 8
#>    DTXSID        DTXCID      TYPE         NAME    VALUE UNITS SOURCE DESCRIPTION
#>    <chr>         <chr>       <chr>        <chr>   <chr> <chr> <chr>  <chr>      
#>  1 DTXSID5020108 DTXCID50108 predicted    Flash … 131.… °C    ACD/L… "ACD/Labs …
#>  2 DTXSID5020108 DTXCID50108 experimental Meltin… 138.0 °C    Alfa … "Alfa Aesa…
#>  3 DTXSID5020108 DTXCID50108 experimental Meltin… 136.5 °C    Alfa … "Alfa Aesa…
#>  4 DTXSID5020108 DTXCID50108 experimental Boilin… 140.0 °C    NIOSH  "The NIOSH…
#>  5 DTXSID5020108 DTXCID50108 experimental Boilin… 140.0 °C    Oxfor… "Until 201…
#>  6 DTXSID5020108 DTXCID50108 experimental Meltin… 122.0 °C    MolMa… "MolMall p…
#>  7 DTXSID5020108 DTXCID50108 experimental Meltin… 134.0 °C    Tokyo… "Tokyo Che…
#>  8 DTXSID5020108 DTXCID50108 experimental Meltin… 135.0 °C    Jean-… "Jean-Clau…
#>  9 DTXSID5020108 DTXCID50108 experimental Meltin… 135.0 °C    PhysP… "The PHYSP…
#> 10 DTXSID5020108 DTXCID50108 experimental Meltin… 136.0 °C    LKT L… "LKT Labor…
#> # ℹ 91 more rows
#> 
# }