computer-aided drug design (CADD)

The application of computers and computational methods in the field of drug design and discovery process is referred to as computer-aided drug design (CADD). It is beneficial in the hit-to-lead discovery, lead optimization which drastically reduced the time and cost factor involved in the new drug discovery process. Modeling three-dimensional structures of ligand and protein, simulation, prediction of binding interactions and energy is a challenging job in the field of drug design. Most of the molecular modeling methods are based on molecular mechanics or quantum mechanics, although both the methods generate equations for calculating total energy of the system but differ from each other in some fundamental aspects.

 

 

Broadly, CADD methods are classified into two categories and are structure-based drug design (SBDD) and ligand-based drug design (LBDD).

 

Structure-based drug design (SBDD)

A high-resolution protein structure or a prepared homology model of the protein is a vital need of structure-based designing. Protein structural information like binding sites, cavities, secondary binding sites, etc., is highly useful for the discovery of small-molecule binding agents which can modulate biological activity. This information is required for ascertaining the molecular interactions of the ligand within the binding cavity. The main target of SBDD is to design and discover ligand molecules with high binding affinity and of complementary features. SBDD includes docking, molecular dynamics, and pharmacophore modeling.

 

 List of drugs identified through computational approaches

 

Target Identification

Drug target identification and its validation is the initial step of the drug discovery process. It is a macromolecule that has an established function in the pathophysiology of a disease. Four major drug targets are found in organisms, i.e. proteins, including receptors and enzymes, nucleic acids (DNA and RNA), carbohydrates, and lipid. The majority of drugs available in the market are addressed to proteins as a target. However, due to the decoding of several genomes of pathogens, nucleic acids could gain big importance as drug targets in the future. The selection of potential drug targets from thousands of candidate macromolecules is a challenging task. In the post-genomic era, genomics and proteomics approaches are the most important tools for target identification. Besides, advances in high-throughput omics technologies generated a huge amount of data for hostpathogen interaction. These available data are also integrated and analyzed by the scientific community through network and systems biology approaches to accelerate the process of target identification in drug discovery program.

 

Drug Targets

The term drug target describes the native biomolecule in the human body whose function can be modulated by a drug molecule, which may have a therapeutic effect against the disease or some adverse effect. Mostly these drug targets are biological targets in nature. Various protein drug targets are currently utilized by available drugs, most of them belong to one of four major drug target protein classes (Table), in some cases, nucleic acids are also utilized by drugs as a target.

 

 

Details of frequently used drug target protein classes

 

 

http://hit2.badd-cao.net/

 

Databases for therapeutic target information

 

 

Drug Target Identification

After identifying the biological nature and origin of a disease, identification of potential drug targets is the first step in the discovery of a drug. Drug target identification follows the hypothesis that the most promising targets are tightly linked to the disease of interest, and have an established function in the underlying pathology, which can be observed with high frequency in the disease-associated population. By definition, it is not necessary for potential drug targets to be involved in the disease-causing process, or responsible for a disease, but they must be disease modifying.

 

Currently, various strategies are in practice for drug target identification, which is either based on experimental approaches or computational approaches. Experimental approaches are mainly based on comparative genomics (expression profiling) and supplemented with the phenotype and genetic association analysis. Mostly, all experimental approaches provide reliable results, and theoretically, theyshould be the first-choice methods for target identifications. Even though experimental approaches are more precise, they are suffering from some practical limitations, i.e., relatively high costs and intensive scientific labor required for experimental profiling of the full target space (>20,000 proteins, nucleic acid) of chemical compounds and they often end with few drug targets in hand. Due to all these limitations, mostly scientists and pharmaceutical companies utilize the computational methods for first-line research and then use the experimental approaches for further validation and other purposes.

 

Computational Approaches for Drug Target Identification

The development of bioinformatics has come up with various bioinformatics resources, including the database, algorithm, and software, which push the CADD in every aspect of the drug designing process (Table). One of the most important contributions is computational drug target identification, as discussed earlier that identification of the drug target is a very crucial and most decisive step of the drug designing process. In this regard, for the last one and half decades, various scientific studies carried out with the aim of drug target identification with the help of bioinformatics resources and proposed various approaches for drug target identifications. These approaches easily handle and deal with a huge amount of genomics, transcriptomics, and proteomics data, and also process it efficiently, and at the end provide potential drug targets in a short period at a low cost.

 

Currently, several computational approaches are available which utilized different molecular information, i.e., gene and genome sequence, molecular interaction information and protein 3D structure. Most of these approaches are interlinked. Still, based on their concept, they have broadly classified into two types:

 

(1) homology-based approaches and

(2) network-based approaches.

 

The major features which are checked for drug target prediction are listed in Table.

 

Important features utilized in drug target identifications

 

 

 

Homology-Based Approaches

Homology-based approaches utilize sequence similarities among genes and proteins, further based on predicted homology, it takes the decision just like decision tree analysis. Mostly these methods consider the various level of homology test, which follows top-down direction. Each level of homology test scale down the data, starting from complete genes or proteome, and step by step either eliminate those which fitted in inappropriate or select only those which fitted in appropriate. Homology-based approaches always ended with countable potential drug targets (Fig.), and because of their scale down nature, these approaches are also known as subtractive (genomic or proteomic) approaches.

 

The term inappropriate and appropriate are conditional, and they are tested on various biological conditions that play a decisive role in target selection. The following are the major conditional tests that help to decide the further consideration of molecules for drug target identification.

Bioinformatics resources for drug target identification and CADD

 

 

Schematic diagram of the standard flowchart for drug target identification using homology-based approach

 

 

Human Homologs

It is assumed that humans have various genes, and few of them are playing an indispensable biological role, considered as housekeeping genes. The use of human housekeeping genes or homologs of human housekeeping genes as a drug target can create lethal conditions and result in the death of human patients. To avoid such accidental use of the housekeeping gene as well as some important pathway-related gene as a drug target gene of the microbial pathogen are generally compared against the human, and those genes which show significant similarities with human housekeeping or crucial genes will be considered as inappropriate and mostly eliminate from rest of the process.

 

Human-Microbiome Homologs

The human body, especially, the gut has a lot of microbes that are already listed by the human microbiome project. Most of these microbes are involved in the biological process, which is beneficial for humans and thus considered beneficial microbes. Use of homologs from these beneficial microbes as a drug target can harm these bacteria, which can affect the related biological process in the human host, i.e., digestion, respiration process, etc., because of the above said reason, human-microbiome homologs are considered as inappropriate and eliminated from the further process.

 

Essentiality

Identification of drug targets against the microbial pathogen assumes that the essentiality of the target protein for pathogen-microbes is one of the advantageous and appropriate features. Without the function of essential proteins, microbial pathogen will not able to survive. Various essential genes and proteins are identified by experimental approaches and enlisted in various databases. The database of essential genes (DEG) is one of the most active databases providing a collection of essential genes and protein sequences. Based on the above concept, those pathogenic genes/proteins which show homology with essential genes/proteins are considered as appropriate and include for the further process.

 

Virulence Factor Homologs

Those proteins whose role in virulence and pathogenicity is reported through the experiment are considered as virulence factors. Various such proteins are available, especially for microbes, and their molecular information is stored in various databases, i.e., virulence factor database (VFDB) and database of fungal virulence factors (DFVF). Genes/proteins of the pathogens that show homology with these virulence factors can be considered as appropriate and utilized as a potential drug target.

 

Drug Target Homologs

Information about known and explored drug/therapeutic targets is available, i.e., therapeutic target database (TTD). Homology mining with TTD is in practice, and those candidate molecules which show significant homology with these known targets are considered as appropriate and included for further exploration.

 

Cellular Location

The cellular location of the target protein is one of the very important features and plays a crucial role in target selection. In a homology-based approach, sequence-based gene ontology (GO) and annotation are in practice to look at the sub-cellular location along with the cellular component, biological process, and molecular function. Generally, those targets whose access is easy are preferable over others.

 

Role in the Biological Pathway

Biological pathways are responsible for the synthesis or metabolism of various bio-products. Few of these pathways are very important and unique, and they are solely responsible for their processes and products. The blockage of these pathways creates a scarcity of their products and finally reduces the chance of survival of the pathogen. Various pathway databases are available to conduct such checks. Current literature shows that the KEGG pathway is one of the richest and preferable pathway databases utilized for this purpose. Those pathways which are unique for pathogen are considered as appropriate pathways, and gene/proteins involved in them were considered for the further process. In contrarily those pathways which are also shared by human/host and their gene/proteins are inappropriate and excluded from further consideration.

 

It has been observed that homology-based approaches are very fast and almost cover the entire target space, and it only needs sequence information as input. Available reviews suggest that uses of homology-based approaches are very common for microbial disease and generally restricted with them only. Their use for other types of infection or disease is not in common practice.

 

Subtractive approach for drug target identification

 

Case Study: Subtractive Approach for Drug Target Identification

The subtractive approach is one of the very famous approaches that have been utilized for target identification against various pathogens. In 2011 Katara et al. presented a subtractive approach exploiting the knowledge of global gene expression along with sequence comparisons to predict the potential drug targets in Vibrio cholerae, cholera causing bacterial pathogen, efficiently. Their analysis was based on the available knowledge of 155 experimentally proved virulence genes (seed information) (Fig.). For target identification, they utilized co-expression based gene mining and multilevel subtractive approach. At the end, they reported 36 gene products as a drug target, to check the reliability of the predicted targets they also performed gene ontology through Blast2GO. They observed these targets for their involvement in a crucial biological process and their cellular location. They found all these 36 gene products as reliable targets and conclude them as potential drug targets.

 

Network-Based Approaches

It examines the effects of drugs in the context of molecular networks (i.e., protein protein interactions, gene networks, transcriptional regulatory networks, metabolic networks, and biochemical reaction networks). In molecular network models, molecules refer as nodes, and each edge corresponds to an interaction between two molecules, based on the direction and importance of interaction between nodes, sometimes edges also mention the direction and weight (Fig.). Drug target identification through the network is based on the fact that networks have many important nodes that are vulnerable and can be targeted in many ways. Most of the time, these nodes are very crucial, and sometimes essential for the whole network structure, inhibition of such nodes can reduce their efficiency and damage of these nodes can shut down the complete network. Network inhibition process follows one of the following two models:

(1) partial inhibitions: Partial knockout of the interactions of the target nodes, and

(2) complete inhibition: all interactions around a given target node are eliminated.

 

In the drug designing process, these target nodes can be considered as potential drug targets. Various molecular networks (Table), including protein-interaction networks, regulatory, metabolic, and signaling networks individually or in integrated form can be subjected to a similar analysis.

 

Various components of a standard network

 

 

Clique : A clique is a fully connected graph / group with same characteristics.

 

Types of the biological network for drug target identification

Centrality Based Drug Target

Network centrality can be used as a potential tool for network-based target identification. Network centrality can prioritize proteins based on the network centrality measures (i.e., degree, closeness betweenness). It can be used to characterize the importance of proteins in the biological system.

 

Hubs as Target

Real-world networks almost show a scale-free degree distribution, which means that in these networks, some nodes have a tremendous number of connections to other nodes (high degree), whereas most nodes have just a few. Here, nodes with a great number of connections than average called hubs. It assumes that the functionality of such scale-free networks heavily depends on these hubs, and if these hubs are selectively targeted, the information transfer through networks gets hindered and results in the collapse of the network.

 

Betweenness Centrality Based Target

Hubs are the centers of local network topology, thus only provide the local picture of the network. Betweenness centrality is another approach that can be used to explain network centre, unlike, hub it provides central elements of the network in the global topology, thus, provide a global picture of network connections. Conceptually, betweenness is the number of times a node is in the shortest paths between two other nodes (Fig.), thus higher the betweenness means more importance of the node in quick network communication. Such higher betweenness centrality nodes can be utilized as a potential target against drugs.

 

Mesoscopic Centrality Based Target

Considering the advantage of both local and global centers of network topology for drug target identifications, the third class of centrality called mesoscopic centrality has also been reported. Mesoscopic centrality is neither fully based on local information (such as hubs) nor global information (such as betweenness centrality) on network structure. It mainly considers long-range connections between high degree nodes, which make a profound effect on small-world networks.

 

Weight-Based Drug Target

Recently, the weighted-directed network is also reported for drug target identification studies. The weighted-directed network is closer to the real, cellular scenario, where PPIs are characterized by their affinity and dominance (link weight) as well as direction (e.g., in form of signaling), as mentioned in Fig. It has been assumed that the deletion of the links with the highest weighted centralities is often more disturbing to network behavior than the removal of the most central links in the similar un-weighted network topology.

 

Utilization of the complex structural information of real-world networks to measure the centrality is not an easy task, and it requires more sophisticated methods to overcome these challenges. Bioinformatics provides various tools to support network construction, visualization, and network-based analysis, i.e., weight, centrality, interaction directions (Table).

 

Molecular network with a different type of connectivity between nodes (A) undirected (B) directed (C) weighted, and (D) weighted directed

 

 

Tools supporting molecular network analysis for drug target identification

 

Properties of an Ideal Drug Target

Identification of potential drug targets is not the last step. Nowadays, through various computational approaches, a huge number of probable targets are reported against different diseases and are available in databases and literature. It is not a good idea to recommend them directly for testing, its recommendation that first, we check them for an ideal property (Table), and then for druggability. Only those targets which fulfill most of them are considered as an ideal drug target and recommended them for further validation and testing.

 

Important properties to assess the ideal drug targets

 

Chemical databases

Chemical Information

Chemical databases are considered as a powerful tool in drug design and discovery. Possible requirements-based searches in the database can find molecules with desired biological activity that might be an appropriate candidate for further analysis. Some important resource databases for chemical information are summarized in Table.

 

Hit Identification

In the early stages of drug discovery, researchers are testing thousands of natural products or plant extracts, small molecules, and looking for a potential molecule to develop as a drug. If they get a potential hit candidate, they move it to the next stage of hit-to-lead optimization. Therefore, the first stage of a significant drug finding project is hit identification. In this practice, hits or small chemical compounds are identified, which bind to the protein and modify its task. So, hits ideally show some degree of specificity and potency against the target. A compound which shows activity against the desired target when tested in a suitable assay is named as hit. A ch emical entity with structural and reproducible activity data. It involves a screening of a wide range of small molecules in an in vitro assay. The molecules with in vitro potency (IC50 value) less than 20 mM are considered as hits. The hit validation process ensures the activity of the molecules in vivo and generates potential lead compounds. Hits free from potential toxicity are called as drug-like compounds.

 

(A) Application of computer-aided drug design (CADD) techniques in various drug discovery stages, (B) application of CADD-based virtual screening for lead identification.

Lead discovery and development

Lead discovery and development involves the identification of a synthetic or natural chemical molecule/peptide/antibody that specifically and efficiently binds to a drug target and thereby modulates its biological function. Lead molecules (also referred to as hit) could be considered as a prototype, from which the drug molecules are developed. The initial step in lead discovery is to identify a starting molecule that shows reasonable biological activity toward the target protein. Both experimental and computational methods are widely used for lead identification and optimization as well. Some of the extensively used experimental methods that are being used for hit identification include high-throughput screening (HTS), combinatorial library screening, knowledge-based screening, fragment-based screening, etc. Alternatively, computational methods commonly known as CADD techniques such as virtual screening (VS) methods have emerged as powerful techniques for hit identification. In VS experiments, a molecular library consisting of millions of chemical molecules is screened computationally (in silico) in a short time and the compounds which are predicted as active/positive are then subjected for further biological testing, whereas the filtered inactive/negative compounds are skipped from biological testing. This VS strategy for lead identification significantly reduces the cost and workload as compared to HTS screening.

Combinatorial chemistry

Combinatorial chemistry comprises the methodology in which a large number of chemical libraries were prepared using a single process and tested for the biological activity. This technology works on the principle of production of a large number of chemical compounds of given methods for performing biological actions. The advantage of the techniques is the reduction in time and money which augments the drug development process. The combinatorial chemistry approaches are appreciated and widely accepted by the pharmaceutical sector for drug designing and screening. The combinatorial chemistry approaches are generally divided into two broad classes:

 

Unbiased library

This is also well known as “random library,” in which the libraries were designed through the synthetic approach concepts without concern about the molecular target. The synthesized library compounds having diversity in the chemical structure were used for the hit identification library for the concern target.

 

Biased library

The biased library approaches the building blocks, scaffold were utilized for the generation of different library types. This technique is limited to a particular building block and is focused on a selective target for biological activity.

 

Solid-phase synthesis

The solid-phase synthesis involves the synthesis of compounds using functionalized solid supports. The solid supports used are polymeric beads and insoluble resin-type materials. In this, the linker groups are attached to the resins beads solid support, and then starting materials or reactants were passed through the solid support to form the intermediates. The library of compounds was synthesized using split and mix techniques such as biological oligomers and polymeric peptide synthesis. Finally, the product was obtained after the detachment from the linker solid support from washing and purification techniques.

 

Advantages

·       Solid support provides easy purification.

·       Excess reagent can be used, providing more yield.

 

Disadvantages

·       Limited numbers of reactions

·       More reaction time required

·       More reagents requirement

·       Monitoring is difficult

·       Expensive

 

Solution-phase synthesis

The solution-phase synthesis is another alternative method for the synthesis of a number of lead molecules. This method has one of the main disadvantages regarding purification techniques to obtain the desired product from the reaction mixture. Therefore, to overcome these difficulties further technological advancement is done in purification methods to obtain the desired product with good quantity and yields.

 

The techniques used for purification involve the following:

·       Polymer-supported reagents and scavengers

·       Liquid-liquid extraction techniques

·       Chromatography techniques.

 

Despite the purification problem in the solution-phase synthesis, it is considered an important tool in combinatorial chemistry for library synthesis.

 

Compound library design

Compound library design usually refers to the generation of a list of structures to be synthesized through combinatorial synthesis. There are a number of approaches that drug designers can utilize in performing this task. Designers may get fairly deep into the experimental plate design, or they may design a library as a collection of compounds that researchers would like to test, without regard for the synthesis route. There are a numberof software tools for aiding with library design, which have a correspondingly diverse range of functionality.

 

Targeted libraries versus diverse libraries

One issue to be considered is whether the library is to be narrowly focused (a targeted library) or very diverse. In the earlier stages of a design project, diverse libraries will often be used in order to explore a wide range of chemistries. Later, strongly focused groups of compounds (possibly differing only by a single functional group) will be synthesized and tested. In general, it is easier to design a narrowly targeted library. This is done to explore possible derivatives of a known structure, usually for the purpose of increasing activity. It is typically done by first selecting a backbone structure (often a fused ring system), and then selecting synthons to be used to create derivatives of that structure. The designer can identify a point on the molecule to be altered, and select a bioisosteric group of functional groups to put at that point. The term “synthon” refers to a functional group to be added at a particular point. Synthons are typically molecular fragments with unfilled valence positions, not reagents to be used in a synthetic reaction.

 

Focused libraries are also easier to synthesize. It is often possible to use the same chemical reaction for all of the compounds, with just one reagent substituted. Designing diverse chemical libraries tends to be a more difficult task.

 

Researchers must contend with some rather difficult questions of chemistry:

·       How diverse should the library be?

·       Are there any two compounds that are too similar?

·       Are there gaps in the chemistries represented, where an additional compound with those specifications should be included?

·       Does the library span the space of known chemistries, or known drug-like chemistries?

Often, a diverse library is created by selecting compounds from a list of compounds already synthesized and available either in inventory or from commercial sources. This is more cost-effective than trying to synthesize a very diverse set of compounds for each drug design project. Late in the drug design process, focused libraries are made for the purpose of improving bioavailability, half-life in the bloodstream, and toxicity. Figure shows an example of a library design tool specifically for this purpose.

 

The ACD/Structure Designer program makes modifications specifically for the purpose of improving the pharmacokinetic properties of an active lead.

 

From fragments versus from reactions

Most library design tools work by allowing the user to suggest the functional group lists; the tool can then generate structures for the compounds that would be created from those lists. When examining the “nuts and bolts” of how library design tools work, there are two different approaches for defining the reagent lists: the fragment approach and the reaction approach. In the fragment approach, backbones and side chains are defined with an open valence location defined as a dummy atom, instead of a hydrogen. This fragment is called a synthon. The program can then generate a list of product compounds by connecting the functional groups in the list to the backbones, knowing that the dummy atom is the point of connection. The advantages of this approach are that the researcher can stay focused on the resulting set of molecules and that there are no ambiguities in how the researcher intends to connect the pieces together. The disadvantage is that the design process is rather disconnected from the synthesis process. Some programs can generate fragment lists automatically from a list of compounds, and some require that every fragment be edited by hand to define the connection points. Figure shows an example of a fragment-based library design program.

 

Class Pharmer from Simulations Plus is a library design tool that uses a fragment based approach.

 

The alternative is to work with a piece of software that allows the researcher to define a chemical reaction and the list of reagents. The advantage of this is that it is closer to the synthesis route, so the results do not often come up with compounds that cannot practically be synthesized with the intended reaction. This can bring to light ambiguities in the synthesis when there are multiple functional groups, thus making it possible to create several different products from the chosen reactants. Regardless of which type of library creation tool is used, the majority of the work at the library design stage is in the process of selecting the backbones and synthons.

 

Non-enumerative techniques

Most library design tools are enumerative techniques. This means that the functional group lists are used to generate structures, in the computer’s memory, for every compound that can be synthesized in the library. The entire list of structures can then be feed into various types of prediction software. Enumerative techniques workwell for designing small to moderate-size compound libraries, containing up to thousands or tens of thousands of compounds.

 

Non-enumerative techniques are useful for manipulating very large library designs. The backbone and functional group lists are generated, just as they are in an enumerative algorithm. However, in a non-enumerative algorithm, the structures for each individual compound are never generated. Group additivity methods are used to compute properties for the library, such as minimum, maximum, and estimated average molecular properties. However, only numbers that describe the entire library are generated in this way-values for individual compounds are never computed.

 

The advantage of non-enumerative methods is that they can be used to explore large, diverse chemical spaces representing billions of compounds, in a way that would not be practical, or even possible, using other techniques. This is a different type of software to work with. Users must get used to thinking in terms of a large theoretical chemical space, instead of being able to see individual molecular structures.

 

Virtual screening

The virtual screening method is based on a comparative analysis between different leads (candidates for new drug candidates), using computational resources, generally based on the corollary that the drug’s action is directly related to its affinity for a biological target. In this context, docking is the most used technique, which consists of the interaction between a molecule and a biological target, calculating the interaction energy between them. The calculations generally employ molecular dynamics, making it possible to obtain interaction free energies and, from there, the inhibition constants (an experimental parameter that can be verified). There are online resources that enable virtual screening, often comparing interaction energies with biological activity data. The use of artificial intelligence techniques has been fundamental in the systematization of these studies. Virtual tissue and organ models are an elegant solution to this approach, being able to anticipate even physiological and neural phenomena.

 

Virtual screening techniques

Drug discovery programs have been considered as challenging and slow processes with a high failure rate. To reduce the burden of cost and time, pressure has been mounting on researchers to identify and separate unsuitable drug candidates in early drug discovery phases.  Although high-throughput screening equipped with combinatorial synthesis has been a front step of “hit to lead” identification, reducing a substantial amount of time, still it suffers from drawbacks like consuming valuable resources and time. In the past two decades, virtual screening has been rapidly developed to make the drug discovery process more fast, cheap, and reliable. Virtual screening is an in silico method which uses various scoring and ranking functions to screen a large number of databases or yet-to-be synthesized chemical structures against a specific biological target. The concept of virtual screening has been developed from the pioneer works of Kuntz et al. and Desjarlais et al. However, Horvath coined the term “virtual screening” in his research paper based on trypanothione reductase inhibitors. This led to the evolution of a new concept in the field of computational drug design for searching new bioactive agents from a database of compounds. These agents are screened based on the structural parameters, predicted to be complementary to a specific molecular target or enzyme. Advancements in the field of computer hardware and algorithms led to the progress and widespread use of virtual screening as a computational method in the drug discovery process. The significance of virtual screening assisted with other in silico tools can be realized from the fact that more than 50 drug candidates have got green signal to proceed through clinical trials, and some of them also got approved for clinical use.

 

Chemical space

It is the hypothetical space containing all the possible chemical structures, which probably ranges from 1018 to 10180 molecules. Analyzing the infinite chemical space for searching bioactive compounds is like finding a needle in the haystack. As the drug candidates belong to various diverse sources (synthetic, natural, marine, peptides, microorganism, etc.), instead of entire chemical space-specific regions a relevant biological target has to be searched. Hence, putting biological activity as a filter enables virtual screening to find the specific biologically active regions of chemical space.  Several bioactive compounds failed to become successful drug candidates because of unfavorable physicochemical properties, which adversely affect their absorption, distribution, metabolism, and elimination (ADME). This led to the evolution of drug-likeness and ADME parameters, which facilitated the virtual screening procedure and its success rate in finding the active medicinal space.

 

Database

An essential part of virtual screening includes preparation of compounds database, where compounds can be stored in 2D (SMILES) or 3D chemical structure formats (MDL SD, Sybyl mol2, CML, PDB, XYZ). Another open-source format developed by IUPAC is International Chemical Identifier (InChI) which can encode chemical structures and is able to identify various protomeric and tautomeric states. These structures are usually annotated with other information like molecular weight, synthetic source, amount available, stereochemistry, tautomer, conformers, and protonation state. These structural data along with various physicochemical as well as biological properties help in screening a database by removing the undesirable compounds, which in turn enriches the database with desirable compounds. Researchers showed keen interest in annotated compounds databases, which have information regarding both chemical structure and its possible biological activity. Therefore, it has gradually succeeded over traditional compound databases containing information on chemical structures only. Drug molecules collected from various sources have been broken into individual fragments using the retrosynthetic principle and again combined in every possible way to create a library of virtual compounds. Virtual combinatorial libraries have a tremendous impact in extending the diverse range of chemical space, which are now available for screening.  A few examples of databases (proteins, nucleic acids, complexes as well as ligands) available in the public domain for free as well as commercial use are as follows: AntiBase, BindingDB, BraMMT (Brazilian Malaria Molecular Targets), ChEMBL, ChemSpider, CMNPD, COlleCtion of Open NatUral producTs (COCONUT), DrugBank, DrugSpaceX, EDULISS, eMolecules, GOSTAR, MCDB, MDDR (MDL Drug Data Reports), MMsINC, OOMT (Our Own Molecular Targets), PubChem, and ZINC.

 

Identification of a bioactive compound against a specific biological target having minimal adverse effects is the primary goal of a drug discovery program. But the presence of promiscuous compounds, frequent hits, and screening artifacts can overwhelm the actual active compounds, which poses a great challenge before the researchers. Pharmacological promiscuous compounds, which act on multiple biological targets, often come out as successful hits in virtual screening, but later on they are found to be nondrug-like. They have a noncompetitive mechanism and poor selectivity as well as a structure-activity relationship.

 

When certain compounds interfere with the assay method and give a false-positive result, they are called artifacts in virtual screening. Various filters are applied to identify and remove such type of nuisance compounds for improving the efficiency of virtual screening.

Classification

Depending on the knowledge of biological target structure, virtual screening techniques can be broadly categorized into two types: structure-/target-based virtual screening and ligand-based virtual screening.

 

Structure- or target-based virtual screening

It involves the ranking of ligands as per their affinity with the biological target, as evinced by the nature of interactions during the formation of the ligand-receptor complex. The ligands may be screened or categorized based on their affinity toward different biological targets. Knowledge about 3D structure of receptors is essential, which are developed either by X-ray crystallography, NMR spectroscopy, electron microscopy or by homology modeling.

 

The protein structure must be checked for structural disorder or missing residues which can be rectified. Identification of a binding site within the protein structure is another prerequisite. A binding site may be associated with metal ions or water molecules which play an important role in ligand binding, and this information is essential during setting up a virtual screening. A number of binding site detection algorithms are being used by various computational tools for the identification of binding sites inside the protein structure. Docking and receptor-based pharmacophore modeling are the two elementary methods for carrying out structure-based virtual screening.

 

Docking

Since the development of the molecular docking technique, it has been proposed to be used as a filter in virtual screening. Docking is an in-silico method which predicts the interactions and probable binding conformation of ligand molecules inside the binding cavity of receptor structure. Prediction of interactions, less computational time, and cost make docking a preferred method for executing virtual screening. A typical docking-based virtual screening consists of four steps: ligand setup/preparation, protein setup/preparation, docking, and postdocking analysis. The large database of compounds should be downsized before docking for removing false positives and unsuitable structures by applying filters like 2D or 3D pharmacophoric features, druglikeness properties, chemical reactivity, etc.

 

Application of constraints in virtual screening

Various constraints which are applied in virtual screening can be divided into three classes. Covalent interaction-based constraints can filter out ligands having specific covalent interactions with the receptor. Conformational space-based constraints can screen ligands occupying a specific region in the conformational space of the binding site. Pharmacophoric or scaffold-based constraints may be applied as either predocking filters to filter out unsuitable structures, or as postdocking filters to select similar binding poses, which satisfies the pharmacophoric criteria. Alternatively, shape and similarity-based constraints and motif-based constraints may also be utilized in virtual screening to ensure a similar binding manner of different ligands. A flowchart of docking-based virtual screening is presented in Fig.

 

Demerits

Docking methods could not perform simulations with sufficient receptor flexibility or on protein with an induced-fit mechanism. Simulations of ligand and receptor in a polar medium, illustrating the effect of metal ions and assigning correct protonation state to the atoms, could not be achieved by docking.

 

Receptor-based pharmacophore modelling

Structure- or receptor-based pharmacophore modeling can be applied in virtual screening for getting new leads from a database of compounds. This method can describe the molecular interactions within the binding site and emerge as an alternative method of virtual screening by overcoming the barriers observed with the docking method. Various pharmacophoric features are mapped onto the ligand structure to derive structural information and interactions responsible for making a compound bioactive. It helps to retrieve structures with diverse bioisosteric scaffolds, which is difficult to explore by systematic derivatization of known compounds. Several pharmacophoric models can be aligned together to create a common pharmacophore hypothesis or model. This hypothesis can be used as a filter for screening the database to find hits with novel scaffolds, e.g., FLAP, GBPM, GRID, LigandScout, MOE (Molecular Operating Environment; http://www.chemcomp.com/), and Unity (Tripos; http://www.tripos.com/).

 

Ligand-based virtual screening

This method is used when the protein structure is not known. The structure of known active and inactive compounds is used as templates, based on which algorithms search for new compounds having structural similarity with the templates. Ligand-based virtual screening can be performed with the help of three methods: similarity search, ligand-based pharmacophore modeling, and machine learning method.

 

Similarity search

The structure of a single active compound is the minimum information we need for similarity searching. The selection of appropriate descriptors is very important in carrying out similarity search-based screening. A compound with known activity is taken as a template or reference, based on which a library of compounds can be screened and ranked accordingly. It can be performed by using either a molecular alignment algorithm or molecular descriptors/fingerprint algorithm.

Ligand-based pharmacophore modelling

When receptor structure is not known, ligand-based pharmacophore modeling can be used for virtual screening. A single/group of compounds(s), with known activity against a specific target, can be analyzed to identify different chemical features from its structure. Different conformations are generated for each molecule and then aligned together to map the corresponding features. However, two different approaches can be used for the generation of ligand-based pharmacophore models:

(i) A database with predefined conformations for each ligand can be used as a filter to speed up the screening, which needs a huge storage facility for handling a huge number of conformations; and

(ii) a single conformation of a known active compound can be used for generating different

conformations, followed by aligning these conformations with the database structures to generate models, which can be used as a filter to screen the database. Although it does not need much storage facility, it is very slow.

 

Catalyst, Disco, and GASP are some of the programs that utilize the ligand-based pharmacophore method.

 

Machine learning method

It is an application of artificial intelligence to create a model, built on a group of experimentally determined actives and inactives. It can predict the activity of an unknown compound against a specific target and also can distinguish active compounds from the inactive ones. Regression models can be built by using training set compounds which correlate activity with the structural information. The machine learning method also utilizes the information of inactive compounds to harvest structure-activity relationship among the dataset compounds. These generated models may be used as a filter in screening large databases. Variouspredicted ADME and other properties can be utilized along with the machine learning method to downsize the hit list. These are of two types: unsupervised and supervised.

 

Unsupervised methods

These methods utilize the descriptor information to correlate the biological activity with the dataset structures. It helps in identifying a specific region of the dataset containing predominantly active or inactive compounds. Since few parameters are used to build robust models, overfitting does not happen with this method, e.g., principal component analysis (PCA), K-means clustering, and self-organizing map.

 

Supervised methods

A group of compounds or a subset of the total dataset molecules with known actives and inactives are selected to form a training set to build a model. The remaining dataset compounds (test set) are used to examine the predictive capacity of the model, known as cross-validation, which is essential to avoid overfitting. Several models are built by taking different training and test set compounds, out of which a single best cross-validated model is chosen, e.g., decision tree (recursive partitioning), K-nearest neighbor, artificial neural networks, and support vector machines.

 

Successful applications of virtual screening

A ligand-based virtual screening of approximately 718,000 commercially available compounds was carried out based on three known glucocorticoid receptor antagonists as query structures. A 3D molecular similarity-based filter and clustering technique was utilized to downsize the database, which followed by lead identification found a compound with good activity (Ki=16 nM). Further optimization led to the discovery of CORT118335, phase II clinical candidate for management of nonalcoholic steatohepatitis and schizophrenia. Virtual screening was carried out on the AZ corporate database of about 1 million compounds based on 10 known fibrinolysis inhibitors as query molecules. A 3D electrostatic and shape-based similarity approach was utilized to obtain an active compound 4-PIOL, which on optimization led to the clinical trial candidate AZD6564 for treatment hemorrhage.

 

Liang et al. have successfully screened the covalent natural products database using herb-based mapping to identify the active compounds baicalein and baicalin showing PLK-1 inhibitory activity. Burggraaff et al. have successfully carried out a statistical and structure based virtual screening for the discovery of several RET kinase inhibitors. Rollinger et al. discovered two novel acetylcholinesterase inhibitors (scopoline and scopoletin) by structure-based pharmacophore screening of 110,000 natural products database by using protein structure and a known inhibitor. Advanced computational resources are being implemented in virtual screening to search lead compounds and to assist hit finding procedure by preselecting compounds for biological evaluation. Although the positive hits obtained from the virtual screening must undergo experimental screening, it saves time and valuable resources which get wasted on the synthesis and activity of random compounds. Undoubtedly, virtual screening makes the drug discovery process fast, efficient, and more economic.

 

Virtual high-throughput screening (vHTS)

In the pharmaceutical sector, costs of developing new drugs molecules are high despite increased spending on rising technologies. To cut costs linked with the development of new drugs, computational techniques have been established. Among these computational methods is the virtual high-throughput screening (vHTS), handout tools to quest novel drugs with the capability to bind a specific molecular target in the process of drug discovery. In the drug discovery approach, vHTS utilizes computational algorithms to detect novel bioactive compounds. It rises the hit rate of novel compounds as it practices more extensive target exploration as equated to old combination chemistry and high-throughput screening, which elucidates the molecular characteristics responsible for therapeutic action aiding the prediction of probable derivatives with enhanced activity. It is a computer-assisted technique in the drug development process that can cut back time and costs, also lowering the omission rate by prioritizing molecules for more tentative examination. A typical procedure in the virtual high-throughput screening is shown in Fig.

 

Tools for virtual high-throughput screening (vHTS)

Some of the available tools for vHTS are Octopus, PyRx Virtual Screening Tool, and Raccoon2.

  

Octopus

It is a virtually automated road-mapping organization tool for performing vHTS. It combined user-friendly molecular docking interface AutoDock Vina, PyMOL, and MGLTools to execute vHTS. Octopus can process molecular docking data for an unrestricted sum of ligands and molecular targets, which cannot be executed in other platforms like PyRx and Raccoon2. During vHTS, it can efficiently operate numerous molecules versus a group of targets. Additionally, it comprises molecular targets databank for malaria, dengue, and cancer. Conveniently, it can also lessen the quantity of biological assays required to discover a pharmacological mechanism. It is predominantly restricted by the time required to draw the ligands’ structures and the choice of preferred targets. Apart from a visual inspection carried out by computational chemists for the intermolecular interactions of the molecules and target, the complete procedure can run automatically.

 

PyRx

It is open-source software for virtual screening of small molecules with a convenient intuitive interface that can operate on Mac OS, Windows, and Linux operating systems. In PyRx, to initiate the virtual screening, structures of the target molecules are required as input files. The input files can be download from a number of freely accessible Web sites such as DrugBank, PubChem, and Protein Data Bank. By default, the integrated AutoDock Vina docking interface outputs the 10 best binding modes for every course of docking. The handlers can also export virtual screening outcomes as SDF files as well as comma-separated values (CSV), which further enhances the analysis and filtering of virtual screening results suitable for third-party packages.

 

Raccoon2

It is a preparatory and analyzing platform of a graphical interface for virtual screenings that utilizes AutoDock Vina as the default docking program. Raccoon2 is a better-developed platform with more flexibility and robustness than the former version Raccoon as the code has been revised from scratch and designed to augment new features such as the investigational characteristics of Fox, allowing simplified examination of outcome, outlining information, and sharing structures among different associated laboratories. Raccoon2 can run only in Linux computational platform with PBS and SGE schedulers. Some of the features of Raccoon2 are automated downloading feature with capable preprocessing of experimental outcomes and filtering of the results by properties (energy, ligand efficiency) and interactions.

 

Techniques for virtual high-throughput screening (vHTS)

There are basically two approaches to this topic: ligand- and structure based vHTS.

 

Ligand-based vHTS

When the structure of the target is unknown, the measured activities for some known compounds can be used to construct a pharmacophore model. Ligand based approaches apply ligand details grounded on resemblance or diversity to earlier identified active ligands to forecast activity. Ligands displaying parallel activity to an active ligand are likely to be more potent than random ligands with few tweaks in the key structural characteristics, such as placement of hydrogen bond and hydrophobic groups; the latter helps in selecting a template with the most capable contenders from the library.

 

Ligand-based vHTS can be categorized into the following classes

Fingerprint-based methods: There is anticipation in the molecules to possess similar characteristics when there is a structural resemblance. Databases are comprised of compounds with similar structures possessing unknown biological activity and may comprise compounds with some desired activity.

 

Generalized pharmacophoric method

Pharmacophore is a structural feature accountable for compounds’ activity at a receptor site. In this method, common pharmacophore layout is identified for a set of known active molecules and the layout is afterward used for the search of a 3D substructure. The software assists to find the spatial arrangement in atoms and functional groups, which match this component with the interrogate molecule.

 

Machine learning approaches

In the machine learning approaches, logic based directions are carried out to define the properties of the substructures associated with biological activity on a training set of data comprising identified active and inactive molecules. It delivers insights into activity.

 

Structure-based vHTS

Structure-based vHTS depends on the 3D structural information of the target protein procured by different techniques such as crystallographic, spectroscopic, and bioinformatics to estimate the reaction energy of the tested compound. Structure-based vHTS involves the finding of a possible binding site of ligand on the target molecules; subsequently, docking of ligands to a target protein with a scoring purpose to evaluate the affinity of a ligand will bind to the target protein. It has been developed to a basis of computational biology and medicinal chemistry, enriching the understanding of the biological target and the chemistry behind the ligand-protein interactions.

 

The important stages in structure-based vHTS are:

·       Preparation of library for ligand.

·       Preparation of target protein.

·       Confirmation of most promising binding position.

·       Grading of ligand-protein docked complexes.

In this technique, for a complete set of virtual compounds database binding modes on the target protein are determined by a docking program.

 

Applications of virtual high-throughput screening (vHTS)

It is used for three major purposes:

(1) To refine a large set of compounds to a relatively smaller set of prevised compounds, for experimentally testing later;

(2) To enhance the affinity of lead compounds by improving the pharmacokinetic assets such as absorption, distribution, metabolism, excretion, and toxicity (ADMET); and

(3) To design novel active candidates by mounting start molecules with different functional groups or organizing fragments into a novel chemical entity.

 

Approved antiviral drugs such as saquinavir, ritonavir, and indinavir are some of the drugs discovered by computer-aided drug design.

 

Pharmacophore

A pharmacophore is an active part of the molecules that states the necessary features which are responsible for the biological interaction and therapeutic activity. The pharmacophore of any molecules determined active functional groups of ligands that interact with the receptor target which promotes the biological inhibition or activation effects of the protein target. The molecular interaction with the receptor of the ligand depends on the pharmacophore features involving polar, nonpolar, aromatic, ring, and charges. These features show different types of interactions with the amino acid residues of the receptor, such as hydrogen bonds with the polar, hydrophobic interactions, and aromatic interactions. The pharmacophore feature in the morphine compound is tertiary alkylamine, and this structure permits the interaction with opioid receptors and responses the biological action through an activated signal cascade that results in the analgesia and sedation-type property.

 

Pharmacophore modeling and similarity search

Through pharmacophore screening, it is possible to identify compounds containing different scaffolds, but with a similar3Darrangement of key interacting functional groups, onto which binding site information can be incorporated.

 

Structure-based pharmacophore modeling

As per IUPAC, pharmacophore is very well defined as “A pharmacophore is the ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response.” Various pharmacophoric features include hydrogen bond donor/acceptor, charged center (positive or negative), hydrophobic and aromatic region, metal-binding region, distance, angle, and dihedral angles. When these pharmacophoric features along with their 3D distributions are assigned within the binding site, it can reveal the structural information and interactions responsible for making a compound bioactive.

 

This method can also retrieve structures with diverse bioisosteric scaffolds, which is difficult to explore by systematic derivatization of known compounds. Analysis of various favorable interactions between ligand and the receptor binding site can be classified under hydrogen bonds (donors/acceptors), electrostatic charge centers (positive/negative), and hydrophobic contacts to generate pharmacophore models. Several pharmacophoric models can be aligned together to produce a common pharmacophore hypothesis or model. Pharmacophore modeling has been utilized for searching databases, virtual screening, fragment designing, and scaffold hopping and for forecasting bioactivity of hypothetical compounds.

 

Ligand-based pharmacophore modeling

In case of nonavailability of protein structure, ligand-based pharmacophore modeling can be employed. It has two essential steps:

(1) pharmacophoric features in training set compounds are analyzed;

(2) aligning all the active conformations of ligands in training set based on chemical features or molecular field descriptors.

 

Ligand-based pharmacophore models are of two types:

 

Qualitative models: A set of active ligands (no explicit biological activity data required) of diverse structures are used to generate a common feature based pharmacophoric hypothesis.

 

Quantitative models: A set of known active compounds (activity data expressed in Ki or IC50) are utilized to create QSAR-based predictive pharmacophoric models.

 

Examples: Catalyst includes two alternative algorithms like HypoGen and HipHop for building pharmacophore-based models. HypoGen assigns a certain weighting factor to each of the chemical features of the ligand responsible for bioactivity and constitutes a pharmacophore model. In this way, several pharmacophore hypotheses can be prepared and ranked as per ability to correlate the bioactivity. HipHop explores the surface accessibility of the active ligands suitable for interactions with the receptor to determine their absolute coordinates. Pharmacophore models are prepared based on the chemical features assigned to the absolute coordinates in different conformations. A number of pharmacophore hypotheses can be generated and ranked based on their ability to explain bioactivity.

 

Disco program adopts a different approach of breaking the pharmacophore into ligand points (hydrogen bonds, charged centers, hydrophobic region) and binding pocket interaction sites (complementary regions within the receptor and is mapped by the coordinates of heavy atoms of ligand). Like catalyst, here also a set of predefined conformations limit the ability to explore the entire conformational space of the ligand.

 

GASP program utilizes genetic algorithm to search the conformational space to generate different models. A ligand with a minimum number of common chemical features is considered as a reference or template structure. All other compounds (present in the training set) are aligned to this template to evaluate the fitness of a specific pharmacophore model based on similarity, overlaid features, and volume integral of the overlay. Unlike Catalyst and Disco, here overall shape along with any steric clashes between the ligands is taken into consideration during the generation of the final model.

  

Quantitative structure-activity relationship (QSAR)

QSAR may be defined as the application of statistical approaches, namely regression and classification methods, on the pursuing of quantitative relationships between the biological activity of a set of congener compounds and their structural, topological, electronic, electrotopological, steric, and physicalchemical properties (among others). This knowledge field assumes that the behavior of a substance in the biological environment depends on its structural characteristics, which affect its overall properties.

 

Historical background

While the discovery of new therapeutic agents was restricted to the empiricism of isolation from natural sources until the mid-eighteenth century, in 1868 Crum-Brown and Fraser,2 working with derivatives of morphine, strychnine, and atropine in guinea pigs, postulated that the physiological action of a drug was a function of its chemical structure. In 1893, Richet established that the toxicity of a series of alcohols, ethers, and ketones was inversely proportional to their solubility in water (which was ultimately related to lipophilia, although this concept was not yet fully established). Overton and Meyer, in 1899, postulated that the narcotic activity of some derivatives was a function of the partition coefficient between chloroform and water, which augmented alongside the set of compounds, suffering a decrease for very lipophilic compounds. This behavior was associated with the fact that compounds with a very high affinity for lipids would not be able to reach with ease the site of action, being retained In the guinea pig’s adipose tissues. Ferguson’s works, in 1939, established the equivalence relationship between the chemical potential of exobiophase and the chemical potential of endobiophase, which allows making inferences about the intensity of the drug’s action in the internal phase by measuring it in the external phase.

 

Quantitative structure-activity relationship (QSAR)

From a set of similar molecules, physicochemical descriptors (electronic, hydrophobic, steric), topological, among others, are obtained. A multivariate regression analysis is performed between these descriptors and biological activity, resulting in an equation that represents a model of the system investigated. The 3D-QSAR approach uses the three-dimensional structures of compounds, properly aligned, on which a probe (for example, a positive carbon) is placed around a grid of points, calculating the steric and electrostatic energy. Each value of these energies becomes a physicochemical descriptor in the model that uses biological activity data as a dependent variable. The points of interaction favorable or unfavorable to biological activity are converted into a three dimensional map showing regions favored by large groups and regions favored by small groups, and regions favored by positive groups and those favored by these groups.

 

QSAR modeling

The conventional approach of structure-activity relationship (SAR) in drug design and discovery has brought many successes but not without a great degree of luck. QSAR method correlates various quantifiable physicochemical properties with biological activity. Usually, this relationship takes the form of an equation which also helps in eliminating the luck factor from the drug design process. Classical QSAR approaches like Free-Wilson and Hansch analysis have correlated the biological activity with certain structural and physicochemical parameters, respectively.

 

 Free-Wilson analysis

Free and Wilson developed a mathematical equation to correlate certain structural features (like the presence/absence of chemical substituents) with bioactivity. It can predict the activity of only those compounds having known substituents that have been included for developing the equation.

 

Hansch analysis

Hansch followed an extra-thermodynamic approach to develop a model in the form of an equation. He proposed that biological activity can be correlated with various physicochemical factors by a mathematical model.

 

Drug action involves two steps:

(i) Transport of drug to the site of action which mainly depends on lipophilic parameters like partition coefficient and substituent hydrophobicity constant.

(ii) Binding of a drug to the target receptor which mainly depends on electronic (Hammett’s constant) and steric parameters (Taft’s constant, Verloop’s steric parameter).

A typical Hansch equation looks like:

where biological activity is expressed as log 1/C because of the very small value of “c”; P is partition coefficient; σ is Hammett’s constant; Es is Taft’s steric factor; and k1, k2, k3, k4 are constants. Hansch mathematical equation helps in the prediction of new or unknown compounds; it also provides information regarding the mechanism of the drug. However, the accuracy of this model depends on various factors like accuracy of biological activity data, inclusion of large dataset, and choice of appropriate parameters. A mixed approach of both Free-Wilson and Hansch model has also been developed to widen the applicability of both the methods. However, three-dimensional parameters cannot be taken into consideration by any of the above two models.

 

Usually, the dataset is divided into training and test set molecules. A QSAR model is built on a training set containing diverse chemical structures including active and inactive molecules. The test set molecules are used for testing the validity, predictive capacity, and accuracy of the developed QSAR model. Currently, various multidimensional QSAR models like 3D, 4D, 5D, and 6D QSAR have already been established based on multidimensional descriptors.

 

Molecular Descriptors Used in QSAR

Molecular descriptors are a numerical representation of chemical information present within a molecule. There are many parameters such as hydrophobic, electronic, and steric parameters, as well as associated descriptors used for QSAR . Descriptors associated with hydrophobic parameters are Partition coefficient (log P), Hanschs substituent constant (π), hydrophobic fragmental constant (f), distribution coefficient (log D), apparent log P, capacity factor in HPLC (log k, log kW), and solubility parameter (log S). Hammett constant (σ, σ+, σ-), Tafts inductive (polar) constant (σ*), ionization constant (pKa, ΔpKa), and chemical shifts are the descriptors used to define electronic parameters. Similarly, steric parameters are defined by Tafts steric parameter (Es), molar volume (MV), Van der Waals radius and volume, molar refractivity (MR), and Parachor. Atomic net charge (Qσ, Qπ), super delocalizability, energy of highest occupied molecular orbital (EHOMO), energy of lowest unoccupied molecular orbital(ELUMO) are known as quantum chemical descriptors. Spatial descriptors such as Jurs descriptors, shadow indices, radius of gyration, and principle moment of inertia are also used in developing a QSAR model. The information about molecular descriptors depends on the representation of a molecule and algorithm used for calculations of descriptors.

 

3D-QSAR

The set of techniques that correlate biological activity with the three dimensional structure of drugs is named three-dimensional QSAR (3D-QSAR).

 

The main objective of this approach is to identify spatial regions in the ligand structure that are complementary to the receptor interaction site. A 3D-QSAR investigation is based on the following assertions:

1. It is the main compound that is being modeled (not one of its metabolites) that is directly responsible for the biological effect considered;

2. The conformation being considered is the pharmacophoric one;

3. The interaction with the receptor and the biological response that arises can be associated with only one conformation;

4. The site of action is the same for the entire series of compounds examined;

5. The biological activity is highly dependent on enthalpic factors;

6. Entropic contributions to the interaction process are the same for all compounds;

7. The system is supposed to be in equilibrium, and kinetic factors are usually not taken into account, as well as solvent effects.

 

The molecular alignment is crucial for the predictability of a 3D-QSAR model. For flexible molecules, assertion 2 is not always true. Preferably, one should take advantage of the maximum available information about pharmacophoric conformation, rather than testing the various alignment methods that exist, such as atom-to-atom alignment in a common substructure, alignment based on load and mass distribution, or alignment based on genetic algorithm. The most used 3D-QSAR methods are CoMFA, HQSAR, and CoMSIA.

 

 

3D-QSAR generates the quantitative relationship between the biological activity of a set of compounds and their 3D structural properties (Fig.). 3D-QSAR uses a probe to determine values of 3D properties such as steric and electrostatic of molecules and then correlate and build a relationship model between 3D descriptors of molecules and its biological activity.

 

Molecular Shape Analysis (MSA)

MSA is an approach that includes conformational flexibility and molecular shape data in 3D QSAR analysis. In MSA, the 3D structure of many compounds is superimposed to find the commonly overlapping steric volume, and common potential energy fields between superimposed molecules are also identified to establish a correlation between the structure and activity of a set of compounds. This analysis also provides structural insight into the shape and size of the receptor-binding site.

 

Self-Organizing Molecular Field Analysis (SOMFA)

SOMFA divides the entire molecule set into actives (+) and inactive (-), and a grid probe maps the steric and electrostatic potentials onto the grid points. The biological activity of molecules is correlated with steric and electrostatic potentials using linear regression.

 

Comparative Molecular Field Analysis (CoMFA)

CoMFA is a grid-based 3DQSAR technique. It assumes that in most cases, the drugreceptor interactions are governed by non-covalent interaction. COMFA considers that a correlation exists between steric and electrostatic fields of molecules and their biological activity. Here, the steric and electrostatic fields of the ligands at the various grid points in a 3D lattice are calculated. Partial least square (PLS) analysis is used to correlate steric and electrostatic fields with biological activities of molecules.

 

Comparative Molecular Similarity Indices Analysis (CoMSIA)

In COMSIA, molecular similarity indices serve as a set of field descriptors. This technique of 3D QSAR is used to determine the common features that are important for binding with the target molecule. Here, not only steric and electrostatic features, but also hydrophobic fields, hydrogen bond donors, and hydrogen bond acceptors are also taken into account for predicting the biological activity of a compound.

 

3D Pharmacophore Modeling

In pharmacophore modeling, the features governing the biological activity are determined from a set of known drugs that binds to a specific target. The entire structure of a molecule is not responsible for carrying out the biological activity. It is the only pharmacophore, which decides the biological response. Pharmacophore modeling is used for searching new potential drugs that share the same pharmacophore as available in other biologically active drugs of the same target. Pharmacophore models are hypothesis on the 3D arrangement of structural features such as hydrophobic groups, aromatic rings, hydrogen bond donor, and acceptor. Structurally diverse molecules bind with the receptor in a similar pattern, and their pharmacophore interacts with the same atom or functional groups of the receptor molecule (Fig.).

 

In the 3D QSAR model, molecules are aligned and superimposed with the core structure, and the molecular descriptors are calculated based on their conformation in the 3D space. The descriptors are correlated with biological activity, and a mathematical model is established. The descriptors in the 3D QSAR are the steric properties of the molecules, electrostatic forces, and force field descriptors.

 

Advantages of 3D QSAR over 2D QSAR

1. No dependent on experimental values.

2. Can be applied to molecules of unusual substituents.

3. Not restricted to molecules of same structural class as in case of pharmacophoric mapping.

4. Predictive ability.

 

QSAR APPLICATIONS

Quantitative-Structure Activity Relationship (QSAR) has been used to predict therapeutic functions of molecules.

• QSAR improves compound libraries used in traditional HTS.

• QSAR can be used to direct combinatorial library synthesis (libraries can be screened against biological targets of interest).

• QSAR has been applied to de novo drug design techniques when structural information regarding the target is unknown.

The major dropouts of candidates during later stages of drug development are due to their observed pharmacokinetic defects. Long-term and tedious studies requires minimum of two years and costly. This cause delay and cost accounting in the pharmaceutical industry. This demand for the prediction tools.

QSAR has gained prominence in the prediction.

1. Quantitative-Structure Bioavailability Relationship (QSBR): A subtype of QSAR useful in analyzing the various parameters affecting the drug bioavailability. A mathematical models correlates the molecular bioavailability to structure descriptors.

2. Quantitative-Structure Metabolism Relationship (QSMR): These models are useful predicting the rate of cytochrome P450 mediated metabolism. This model also predicts the rate of hydrolysis of prodrugs and soft drugs.

3. Quantitative-Similarity Toxicity Relationship (QSTR): Toxic effects of clinical candidates are the one of the major concern. This QSAR subtype is useful in predicting the possible toxicity outcomes of the molecules through the descriptor analysis.

Molecular docking

It starts with binding site prediction or identification. The binding site is usually a concave region on the periphery of the protein where interaction with the ligand occurs. Topological methods for identifying active sites, such as CASTp, allow us to study the suitable sites for this interaction. Docking uses knowledge of protein binding sites to test for interactions with small molecules.

 

Numerous positions are tested and then classified according to scoring criteria. This scoring follows a mathematical function that varies depending on the program and that distinguishes the different docking methods in their applicability. The interaction energy is obtained through an empirical expression that involves enthalpic, entropic, and hydrophobic contributions, in addition to solvation effects. Docking can be rigid, which is useful when scanning large databases of structures, or flexible, when we want to draw a more detailed profile of the ligand-receptor interaction. The effects of the solvent on the interaction can be simulated indirectly, through the introduction of a correction in the expression of the interaction energy, or explicitly, through the use of solvent boxes.

 

Molecular Docking Types

1. Rigid body docking: In this docking procedure, the target protein and ligand conformations are considered rigid. The conformational flexibility in the molecular bond angle, bond length and torsion angle components are not permitted. The substantial conformational change at molecular level is not accounted in this type of docking, hence inadequate.

2. Flexible-ligand docking: In this type of docking simulation, the target protein is considered rigid (rigid conformation). The translational, rotational and conformational degrees of freedom of the ligand can be explored.

3. Flexible docking: This type of docking provides all degrees of freedom to the target protein and ligands under investigation. It accounts complete conformational change occurring at molecular level.

Different Types of Docking Based on Interactions

The selection of appropriate algorithms, tools, and parameters for docking is an important challenge in molecular docking. In nature, different types of molecular interactions such as proteinligand (small molecule), proteinpeptide like molecule, proteinprotein, proteinnucleic acid, or nucleic acidsligand take place. Different types of docking tools have been developed keeping in mind the nature of interacting molecules, possible forces, and other parameters. In the field of medicinal chemistry, ligand promiscuity is the topic of discussion. Different folding patterns and structural arrangements were deposited in large repositories like PDB, etc. The search for patterns and similarities in binding sites and protein pockets allows the detection of structural changes and behavior. Docking has been classified into many categories based on the nature of the molecules involved in the interaction.

 

ProteinLigand Docking

Structure-based design is a very powerful approach to druggable targets. Docking predicts the pose or orientation of a ligand on the binding site of a target molecule or enzyme. For flexible proteins, protein-energy landscape exploration (PELE) is used for the correct assessment of binding sites and poses. Through machine learning and molecular dynamics using techniques, like self-organizing maps (SOMs) or k-means determines the complementarity of protein and ligand conformations. For free energy calculation, MTflex  uses Monte Carlo integration and generating rotamers for binding residues based on low-energy values along the free energy surface.

 

ProteinPeptide like Ligand Docking

The peptide as a sample is highly variable due to high flexibility. Nowadays, peptides are being used in the medicinal areas proving their polypharmacological effects and suitability of proteinprotein interaction. It involves calculations that relate to confirmations and poses highlighted in Fig. Proteinprotein interaction networks can be perturbed by differential gene expression and disease mutations. Molecular modeling approaches play an important role in optimizing the activity of known peptide and also in designing the novel peptide as an inhibitor.

 

ProteinProtein Docking

In proteinprotein docking, protein complexes are determined through sequence alignments, structural comparisons, and multiple proteinprotein interactions, within their defined confirmations and docking positions. Protein structure initiative provides significant structural information for the community assessment of structure prediction (CASP). For proteinprotein docking and macromolecular interactions, critical assessment of protein interactions (CAPRI; http://capri.ebi.ac.uk) acts as a contest-space to challenge different human groups, software, and servers into correctly predict the conformation of interacting proteinprotein pre-chosen targets. Proteinprotein docking can be approached as a prediction for the whole complex minimizing each protein by coarse grain models and using local search for the binding sites. Thus, the major challenge

for proteinprotein docking is the flexibility of the backbone. For this reason, comprehensive computational studies need to be conducted to successfully distinguish

realistic complexes from unrealistic predictions.

 

ProteinNucleic Acid Docking/Nucleic AcidLigand Docking

Proteins and nucleic acids are the two main biological macromolecules which act as a target for many processes/functions. ProteinRNA and proteinDNA interactions are very important for replication, transcription, splicing, translation, and nucleic acids degradation. Abnormalities in proteinnucleic acid interactions are associated with a number of neurological diseases, cancer, and many other metabolism associated issues. Proteinnucleic acid complexes are being solved by the researchers which may help in understanding different interactions. NPDock is a proteinnucleic acid docking tool and it uses the DARSRNP and QUASI-RNP statistical potentials for scoring interactions of proteinRNA complexes. RNA molecules have recently got an attraction as a drug target due to their importance in biological key processes. However, as of now, the structure-based docking that involves RNA molecules binding with a small molecule (ligands) is not well established that lies under the protein ligand docking category. LigandRNA is a scoring function used for predicting the RNAsmall molecule interactions. It is based on a grid-based algorithm, and a knowledge-based potential for scoring is derived from sites of ligand-binding interactions in the known RNAligand complexes. LigandRNA takes RNA receptor file and ligand pose file as an input and provides the ranking poses consistent with their score as an output. The modified version of Dock6 also includes RNAligand docking facility. LigandRNA docking related problems were solved by incorporating classical molecular mechanics force field for calculating the interaction between the RNAand ligand.

 

Molecular Docking Steps

A typical docking procedure consists of four steps: ligand setup/preparation, protein setup/preparation, docking, and postdocking analysis.

 

Ligand preparation

The input file format of a chemical structure is very important because it represents the atomic coordinates, bond types, and bond order of a ligand. Docking results along with the molecular interactions may get affected by any of the parameters of the ligand, i.e., protonation state, tautomer, conformer, etc. Therefore, ligand structure is converted into 3D conformation and then refined by energy minimization using molecular mechanics protocol.

 

Protein preparation

A high-resolution X-ray crystallographic structure of a protein is usually preferred over other structures. In case of nonavailability of X-ray structure, the protein structure can be generated from homology modelling also. When several X-ray structures of the protein are available, holoenzyme (with ligand) complex structure is preferred over apoenzyme (without ligand). If many holostructures are available, the one with a cocrystallized ligand at the binding site is preferred. Hydrogens are added to the protein structure, especially

polar hydrogens for optimizing the hydrogen bonding network. Missing side chains or residues should be checked and corrected before the final refinement of protein, in which it undergoes minimization to remove any clashes. The binding site information is obtained from the cocrystallized ligand complex at the binding site of the enzyme. It helps in confining the 3D space of the binding site into a grid box, where a suitable binding pose of the ligand is searched. In the case of apoenzyme structures or when the binding site information is unknown, a time-consuming blind docking is performed where the entire protein surface is scanned for a suitable binding site. Currently, many programs are available which can detect the binding sites within an apoenzyme structure based on various pharmacophoric features.

 

Docking

The pioneering work of Kuntz et al. has led to the development of many open-source as well as commercial docking programs. It is comprised of a search algorithm and a scoring function that critically determines the speed and accuracy of docking. It searches and ranks various poses of ligands inside the conformational space available within the binding site of the receptor. Validation of the docking program is carried out by satisfying various benchmarking parameters to prove its speed, accuracy of prediction, and ability to distinguish actives from inactives.

 

Binding free energy calculations

Binding free energy is an important aspect of ligand to find fitting within a binding site and achieving the global energy minimum for the complex. Thus, free energy is of crucial importance as they regulate binding affinities, protein folding. And as fundamental thermodynamic quality in binding, folding, and reaction kinetics, it is important to quantify binding free energy in computational techniques.

 

The binding free energy accompanying the binding of a ligand to target protein is determined by the following equation:

 

 

Postdocking analysis

The top poses are ranked by the least binding energy score (usually with a negative sign). Poses which show steric or electrostatic clashes may be screened by applying topological filters. Energy minimization of a ligand pose inside the binding site can be performed by another program and analyzed by machine learning methods. Scaffold enrichment may be applied on the hit molecules as an alternative refinement method to recover false negatives, if they share a common structural framework with true positive ligands.

 

Based on the flexibility of ligands and receptors, docking can be of two types: rigid docking and flexible docking. When both the receptor and ligand are kept rigid, a limited search space is available encompassing only three rotational and three translational degrees of freedom. A predefined set of ligand conformations can be used to address the ligand flexibility. Various scoring functions like Monte Carlo simulations, simulated annealing, evolutionary, and genetic algorithm methods have been used to incorporate ligand flexibility, whereas the receptor is kept rigid. Molecular dynamics (MD) simulations are used to treat receptor flexibility, but they consume much computational resources and time.

 

Analysis of Docking Results

Analysis of docked complex structure obtained from molecular docking study is one of the essential tasks for visualization of proteinligand interaction at the atomic level using molecular modeling tools in 2D or 3D. In this analysis we can identify the number of hydrogen bonds formed among a different functional group of the ligand with amino acid residues present in the binding site of protein along with their bond length, because hydrogen bonding plays a significant role in proteinligand interaction. Besides, we can also analyze hydrophobic and cationpi interaction. This analysis facilitates researcher to choose the best interacting ligand because in some cases, the binding energy of two or more ligands is the same but the number of interacting amino acid residues are less or more. Therefore, in such a situation, generally we choose ligand having more interaction

with the target in terms of interacting amino acid residue numbers. PyMOL and Chimera are widely accepted tool analyses of docking results by selecting different poses of ligand generated during molecular docking and visualization of interacting residues in 3D. Besides, LigPlot is one of the highly cited and recommended tools for analysis of docking results in 2D format.

 

Validation and Accuracy

Docking accuracy is usually assessed by the ability to reproduce the experimentally determined binding mode of a ligand. The predicted binding mode is chosen based on the best-scoring protein-ligand complex. Docking time will increase linearly with the number of structures used. The success rate in retrieving binding modes of known protein-ligand complexes is an important validation for docking programs. The success of the docking is measured by root mean square deviation (RMSD) between the experimentally observed heavy-atom positions of the ligand and those predicted by the program. Poses with an RMSD < 2 Å are considered a success, and docking whose RMSDs lie between 2 and 3 Å are considered as partial success. The evaluation of the docking results relies mainly on a scoring function that ranks the bonding poses according to specific properties. State-of-the-art docking programs correctly dock ~70-80% of ligands when tested on large sets of protein-ligand complexes.

 

Absorption, distribution, metabolism, excretion, and toxicity (ADMET) prediction

The determination of pharmacodynamic, pharmacokinetic, and toxicological properties is essential for obtaining relevant bioactive compounds. Lipinski’s rules (a molecule with a molecular mass less than 500Da, no more than 5 hydrogen bond donors, no more than 10 hydrogen bond acceptors, and an octanol-water partition coefficient log P not greater than 5) allow determining whether the compound has potential as a drug without presenting toxicological characteristics that make its use unfeasible. Many tools for ADMET are available, such as QikProp.

 

ADME analysis and measures of drug-likeness

Usually, 9 out of 10 research projects in drug discovery course face end-stage failure. These 10 projects involve the synthesis of about 10,00020,000 molecules, followed by their activity studies. The failure rate of the drug discovery process in the pharmaceutical industry is too high, usually about 99.99%. These failures come with a huge price of approximately $500 million and $2billion. Many candidates undergo failure in late-stage clinical studies due to poor ADME (absorption, distribution, metabolism, elimination) properties.

 

Therefore, various computational models were developed to predict ADME properties before clinical studies. It helps in the preselection of good drug-like candidates for synthesis and activity studies, reduces the failure rate and cost involved in a clinical trial by removing compounds with bad ADME profile, and improves the understanding to correlate experimental and predicted ADME parameters. Gradually, various drug transport models like Caco-2 cell permeability representing intestinal absorption were also incorporated. Eventually, the toxicity profile gets coupled with other ADME-related properties to develop ADMET parameters which have been successfully used in virtual screening procedures for filtering large databases to select hit molecules. Data obtained

from high-throughput in vitro screening assays are used by numerous computational methods and descriptors to build ADMET models. Various molecular property-based descriptors like polar surface area, hydrogen bondingnetwork, octanol/water partition coefficient, or semiempirical-based descriptors can help in the quantification of pharmacokinetic or ADME properties, which is further correlated with its 3D structure by suitable models.

 

Measures of drug-likeness

Qualitative evaluation of basic descriptors is the most common way for examining the ADME profile of a molecule, as proposed by Lipinski’s “rule of five.” A compound is considered to have poor absorption if it violates any two of the proposed parameters (molecular weight not more than 500; number of H-bond acceptors and donors not more than 10 and 5, respectively; calculated logP not more than 5). Researchers subclassified the descriptors based on oral and nonoral drugs, different target diseases to examine their effects on the ADME parameters. Vieth et al. evaluated 1729 marketed drugs and reported a statistically significant difference between injectable drugs (high molecular weight, more polar) and oral drugs (low molecular weight, less polar). Moreover, the pharmacokinetic parameters for injectable drugs were more flexible as compared to oral drugs. It led to the conclusion that pharmacokinetic parameters need biased property distributionas per different targets and routes of administration.

 

Aqueous solubility and lipophilicity

Aqueous solubility is an essential parameter to be predicted for molecules targeting the oral route of delivery. Good solubility is highly necessary for in vitro, in vivo assays and for predicting the absorption in the gastrointestinal tract. Poor solubility negatively affects absorption and assay results and increases the development cost. Therefore, various quantitative structure-property relationship (QSPR) models have been established for the prediction of aqueous solubility using numerous molecular descriptors. Since the composition of gastrointestinal fluids is not taken into account, aqueous solubility cannot be considered as an optimal model for predicting solubility. Yalkowsky and Jain have developed an in silico model called “general solubility equation” to predict aqueous solubility with good accuracy. Lipophilicity of a drug helps itself in getting dissolved in lipid phase, and thus it can pass through the bilayer lipid membranes in the gastrointestinal tract, which can be predicted by the descriptors: LogP (octanol/water partition coefficient), LogD (distribution coefficient), and chromatographic hydrophobicity index.

 

Ionization state

Represented by pKa, it affects the solubility, lipophilicity, and permeability of a compound and thus becomes essential for good absorption of oral drug candidates. Hammett and Taft’s approach, semiempirical methods, and density functional theory are some of the methods which are used for the prediction of ionization state.

 

Permeability

Transport of a drug across the membrane by a passive method is referred to as permeability. Apart from the in vitro determination of partition coefficient and distribution coefficient, several cell line-based assays have been developed. Caco-2 (human colorectal carcinoma) cell line can be modeled to predict intestinal permeability of drugs. This model has been utilized in the early stages of drug discovery for ranking compounds based on absorption and permeability. Similarly, MDCK (Madin-Darby canine kidney) cell-based assay can also be modeled to predict permeability as well as drug-receptor interactions.

 

Blood-brain barrier

Both the aqueous solubility and lipophilicity of a compound determine its capability to penetrate the blood-brain barrier by passive diffusion. It was measured by the parameter LogBB, determined by the ratio of concentration of drug in brain to concentration of drug in blood. Currently, several models have been developed for the accurate prediction of LogBB by using various machine learning methods.

 

Distribution

Volume of distribution can be predicted by various in silico models, which correlates the lipophilicity and solubility descriptors with free and bound fraction of drug with plasma proteins. Amo et al. have established a model to estimate the volume of distribution whose accuracy was comparable with commercial counterpart Volsurf+.

 

Metabolism

Various in silico models can predict the site of metabolism along with its substrate nature against a specific metabolic enzyme. A vast dataset of diverse chemicals can be taken to generate models and then converted into online prediction tools for determining pharmacokinetic parameters related to metabolism, e.g., fast metabolizer, SMARTCyp.

 

Excretion

Kusama et al. have developed a chemoinformatic-based model based on molecular weight, lipophilicity, charge, and protein-bound fraction in plasma. It helped in the prediction of major clearance pathways of 141 drugs with good accuracy.

 

However, lack of a larger experimental dataset is a major obstacle in the development of more accurate ADME models. Yet, computational chemists strive to develop good predictive ADME models to assist the drug discovery project.

 

ADMET screening

ADMET is the short form for absorption, distribution, metabolism, elimination, and toxicity. In the course of drug designing processes for the period of last two and half decades, in silico ADMET analysis has become a very useful means of cost-effective tool as the ADMET properties are accountable for the nonfulfillment of 50% or more of drugs in the clinical stages. ADMET analysis of drugs is principally focused on computational pharmacokinetics and toxicity modeling. Conventionally, in the drug development pipeline, the ADMET tools are applied at the end stages, but in the recent time, it is applied at an early stage of the drug development procedure owing to the easy accessibility of in silico ADMET tools which readily discard molecules with poor ADMET at the early stages of the development, leading to substantially cost-effective drug development process. Some of the commonly available tools which are involved in the monitoring of ADMET characteristics are ADMETlab, admetSAR, SwissADME, PreADMET, eADMET, and Tripod.

 

ADMETlab

It is a Web-based tool that broadly relies on the incorporation of a database collected from the existing ADMET and several basic endpoints associated with possible physicochemical profiles to facilitate the ADMET evaluation. There are four main considerable components that allow users to properly evaluate ADMET properties, which are:

(1) Assessment of drug-likeness profile using rules of five and one prediction model.

(2) Prediction of ADMET properties using 31 endpoints including,

·       3 on basic property

·       6 on absorption

·       3 on distribution

·       10 on metabolism

·       2 on elimination

·       7 on toxicity

(3) Evaluation of efficient ADMET for single chemical entity and.

(4) Similarity searching and comparing against 288,967 entries in the ADMET database.

The design of ADMETlab is created based on Python’s Django framework and is freely available online at http://admet.scbdd.com/.

 

admetSAR

It is an open-access tool containing a database associated with ADMET properties that constantly update by collection, curating, and management of available data related to ADMET properties from the existing various published literature studies. In admetSAR, for an exclusive collection of over 96,000 compounds there are ADMET annotated data points of 210,000 or more, with 45 different types of ADMET-related properties. The admetSAR database can be easily accessed by a user-friendly interface using CAS registry number, common name, or structure similarity to enquire a definite chemical’s ADMET profile. To predict ADMET characteristics of novel chemicals with high accuracy, the database includes

·       22 qualitative classification and

·       5 quantitative regression models

To safeguard its utility and quality, the database is updated every month or quarter with the addition of high-quality published data associated with toxicity. AdmetSAR is available at free of cost from the Web site: http://www.admetexp.org.

 

SwissADME

It is a freely available Web tool that allows the prediction of pharmacokinetic profile, drug-likeness, and medicinal chemistry suitability of a compound; it also contains in-house proficient methods such as the BOILEDEgg, iLOGP, and Bioavailability Radar. It is easily accessible from the login-free Web site http://www.swissadme.ch. Compared to the other freely available Web-based tools for ADME and pharmacokinetics screening such as pk-CSM and admetSAR, the strong points of this tool are:

·       Comprising different input methods.

·       Ability to compute several molecules.

·       For each individual molecule option to display, save, and share outcomes.

 

Toxicity prediction

During drug discovery and development, identifying molecules with the best possibility to become a clinically useful compound, candidates are evaluated according to diverse parameters to direct the selection and promotion of chemicals for synthesis and test. It is important that the molecules must display high biological activity along with low toxicity. Since more than 50% of the candidates failed in drug development stages owing to toxicity, a set of toxicity screening has been implemented in most pharmaceutical establishments to counter this failure at the development with an aim of discarding compounds that are likely to fail further down the line in the discovery phase. Some of the available tools for toxicity screening of compounds are pk-CSM and PreADMET.

 

pk-CSM

pk-CSM is a Web-based, freely available tool for the analysis and optimization of pharmacokinetic and toxicity properties. It assists medicinal chemists to discover the balance between safety, effectiveness, and pharmacokinetic properties. The 30 inbuilt predictors are divided into five major classes:

·       predictors based on absorption, which contain seven predictors

·       predictors based on distribution, which comprise four predictors

·       predictors based on metabolism, which comprise seven predictors

·       predictors based on excretion, which comprise two predictors, and

·       predictors based on toxicity, which comprise 10 predictors

The tool predicts small molecules’ pharmacokinetic and toxicity profiles. The 10 predictors in pk-CSM based on toxicity, which predicts toxicity profiles of compounds, are:

·       Maximum recommended tolerated dose (MRTD)

·       Oral rat acute toxicity (LD50)

·       Oral rat chronic toxicity—lowest observed adverse effect (LOAEL)

·       T. pyriformis toxicity

·       Flathead minnow toxicity (LC50)

·       hERG I inhibitor

·       hERG II inhibitor

·       AMES toxicity

·       Hepatotoxicity

·       Skin sensitization.

pk-CSM is freely available online from the Web site: http://structure.bioc.cam.ac.uk/pkcsm.

 

PreADMET

PreADMET is a Web-based tool for the analysis of ADME and toxicity data. The functions of the tool can be divided into four parts:

(i) Molecular descriptors calculation;

(ii) Drug-likeness prediction considering well-known rules;

(iii) ADME prediction; and

(iv) Toxicity prediction.

In this tool, toxicity prediction of a compound is carried out by two predictors: Ames test and rodent carcinogenicity.

 

Ames test

It is conducted to assess the mutagenicity of a compound. It predicts molecules’ toxicity profile by comparing the data of NTP (National Toxicology Program) and US FDA, which are the outcomes of the in vivo carcinogenicity tests of mice and rats for 2years.

 

 

Molecular dynamics

Molecular dynamics can be defined as the set of experimental protocols that simulate the conformational variations experienced by the molecule by the action of forces acting on the medium. This procedure applies Newton’s equations to the internal coordinates of the molecule, within a given time frame, in order to follow the variations in the internal degrees of freedom of the molecule. The differential form of Newton’s classical equation Fi = mi ai. An error inherent to the method will be more pronounced the longer the time step used, which is typically on the order of 0.5 to 1 femtosecond (1 fs = 10-15 s). In general, we start with a minimized structure and process the dynamics, recording the resulting structure from time to time and minimizing each of these conformations again, so that, in the end, we proceed to select the conformation whose energy no longer varies.

 

Molecular dynamics (MD) Process

During molecular interactions, the receptor and especially its binding site can undergo conformational changes, which affects the binding energy as well as stabilization of the ligand-receptor complex. Receptor flexibility has been overlooked by docking methods to achieve speed but compromising with accuracy. Molecular dynamics has been successfully employed for simulation of ligandreceptor binding, conformational sampling, and accurate prediction of the energetics of the system. For simulating the movement of each and every atom of the ligand and receptor, we need to quantify the velocity as well as the force acting on the atoms. Initial potential energy gives information about the coordinates, energy, and velocity for each atom of the system. Applying a force on each atom for a very short span of time (approximately in femtoseconds), we can determine the acceleration from the Newtonian equation of motion, and subsequently velocity and coordinates at a new position for each atom. This process is repeated to get new positions of atoms with respect to the applied force and gradually takes the shape of a trajectory. Once the trajectory is defined, we can simulate the motion of an atom a short time into the future. The new position of the atom at a specific time in the future can be determined from its initial position coordinates and velocity by solving the equation known as Taylor’s expansion. Then, energy minimization of these structures is carried out by molecular mechanics which also helps in the study of conformation and energetics.

 

Two important factors like the temperature of simulation and time steps are critical for carrying out MD simulations. Application of high temperature during simulation helps in overcoming the energy barriers, so that the conformations do not get trapped in local minima and can reach global minima. The selection of a small yet proper time step is essential for searching all the possible conformations. In short, molecular dynamics is referred to as solving the Newtonian equation of motion for all the atoms of the system as a function of time.

 

A common protocol for MD simulation involves the following steps:

Target structure: A 3D receptor structure (good resolution), determined by the NMR method or X-ray crystallography, is preferred for only receptor simulation. It can be downloaded from the protein data bank (http://www.pdb.org). In the case of ligand-receptor complex simulation, usually a docking output file is used as input for MD simulations where the ligand is present at the binding site.

 

Input structure: Topology parameters are generated for all the atoms present in the system which contains the necessary information about the atoms, bond connectivity, angles, their coordinates and velocities, etc. Required hydrogen atoms are added, and protonation states, terminal residues, and disulfide bridges are checked.

 

Setting the simulation environment: A simulation box (similar to the grid box in docking) is created around the binding site and is immersed into a periodic box of water molecules to solvate the protein. The water molecules are represented by various in silico models like simple point charge (SPC) or extended simple point charge (SPC/E), or three-point (TIP3P), or four-point (TIP4P). A proper dielectric constant and force field is selected along with the addition of required counterions to set the simulation environment.

 

Energy minimization: Before MD simulation, a short energy minimization (usually 500 iterations with the steepest descent algorithm) is carried out. It helps in relaxing the structure and also removes high-energy artifacts like broken hydrogen bonds which can distort the entire system.

 

Heating up the system: A high temperature of 1000K is applied for 125 picoseconds to stabilize or equilibrate the core structure.

 

MD simulation: The simulation period is set up as per the protein size and the availability of computational resources. The time interval is also set up, at which the output coordinates of the system are recorded for further analysis. For example, a simulation is run for a period of 200 picoseconds with a sampling time of 1 picosecond. Different conformations were recorded at 1 picosecond interval and confined the total number of structures to 200 frames.

 

Trajectory analysis: All the conformers were retrieved, and energy was minimized to rank the structures with the lowest energy binding modes. The stability, as well as the structural integrity of the system, is determined by measuring the root-mean-square deviation of all heavy atoms with respect to the parent structure. Free energy of binding can also be calculated to compare the stability of the complex before and after the simulation.

 

Molecular dynamics simulations are helpful in the identification of cryptic binding sites, allosteric binding sites, binding pose of ligand, and accurate estimation of binding affinity. MD simulations have also been used in virtual screening for allowing receptor flexibility during screening. This method is known as a relaxed complex scheme (RCS) where MD simulation is run on the receptor structure to obtain multiple conformations, with which the potential hit candidates can be docked. Therefore, every hit candidate is associated with a series of docking scores and can be ranked based on the average docking score over a receptor. The development of new force fields, conjugation of quantum mechanics, and upgraded computational resources have significantly improved the performance and applications of MD simulation.

 

Current Tools for MD Simulation

Several tools are available to investigate the atomic-level changes in the biomolecules using the MD simulation method. Some provide the graphical user interface like Desmond, while some run in command lines like GROMACS and AMBER. Some famous and widely used tools for MD simulation are GROMACS, AMBER, Nanoscale MD (NAMD),

and CHARMM-GUI. For running such MD simulations, increased hardware power and software are essential components.

 

 

Recent Advances in Hardware to Run MD Simulation

Rapid development in computer hardware is a crucial part of MD simulation. Two reasons have an impact on trajectory analysis. The first one is the run of the long simulation result in GBs to TBs data storage, and the other is to develop the new rendering engines for the visualization effects using the latest video chipsets. Due to the advancement in the computer hardware, simulations can be performed from ns to μs with the help of GPUs (graphics processing units) that is configured with the molecular simulation suite. The GPU cards are replacing the CPU (central processing unit) and becoming commodity software and play a crucial role in decreasing the time for MD simulation. The CUDA (Compute Unified Device Architecture) is a newly invented parallel computing platform, and its use in GPU increases the number of cores to run a long simulation within time. Due to the emergence of GPU-CUDA technology, vigorous and massively parallel clusters are developed, such as special purpose supercomputer Anton and Blue waters. They are precise for running the MD simulation of biomolecules from μs to ms time-scale. But such

resources are limited for limited researchers. To remove the time-scale gap, there is an urgent need to develop newer algorithms that allow enhanced sampling in the defined areas of conformational space and access long time-scale actions using necessary hardware. The purpose of this algorithm is to collect sufficient sampling that could result in the Boltzmann distribution of the diverse conformational states for the accurate calculation of the thermodynamic and kinetic properties of the system. By the modification of the Hamiltonian method is to add a bias potential, several approaches have been developed like hyper dynamics, local elevation, and accelerated MD. In the case of hyper dynamics simulation, the identification of transition state required, but it is not necessary for classical MD simulation. Several tools are available to perform the MD simulation study with CPU or GPU. A few of the widely used tools are described in brief below.

 

GROMACS

GROMACS is the most widely used software for MD simulation. It is a freely available tool, and a brief tutorial of this tool can be accessed by this link (http://www.mdtutorials.com/gmx/) (Pronk et al. 2013). In the GROMACS simulation kit, MD simulation can be performed at various temperatures and pH values. In this simulation tool, several commands are available to perform a distinct function and calculate specific structural parameters. GROMACS, which is one of the MD simulation software, can read only the 20 natural amino acids, i.e., the non-standard amino acids are not read by GROMACS algorithms. Sometimes, there are force field limitations, for instance, Gromos and Amber cannot read the nicked DNA, but the same force field can read the same non-nicked DNA. The brief methodology for MD simulation using GROMACS is shown in Fig.

 

To start, the user creates a box and fills the solvent (water). The solvent model depends on the force field. After placing the protein in the defined box in the solvent, the charge of the system is neutralized either by the addition of Na+ or Cl_ions; this is followed by the minimization of the system using the steepest descent method. Then, NVT (the constant Number of particles, Volume, and Temperature) simulation is run to maintain the volume and temperature of the defined system. The temperature of the system arises from 0 and attains the desired temperature that is set by the user. After that, NPT (the constant Number of particles, Pressure, and Temperature) simulation is run to maintain the pressure of the defined systems.

 

Several parameters are set by the addition of the .mdp file. Finally, MD simulation is performed that provides the coordinates of each step in the form of a trajectory. The trajectory can be analyzed by using various tools that are embedded in GROMACS, like gmxrms, gmxrmsf, gmxgyration, and gmxhbond. These data can be plotted in an interactive form by using GRACE (Graphing Advanced Computation and Exploration of data), a Linux based software. For example, a water embedded protein molecule, placed in a box visualized by VMD (Visual Molecular Dynamics) shown in Fig.

 

AMBER

AMBER simulation suite is a collection of programs that are used to carry out and analyze the MD simulations for proteins, carbohydrates, and nucleic acids. Three main components of the AMBER tool are preparation, simulation, and analysis. The Antechamber and LEaP are the main program for the preparation of macromolecules. The Antechamber tool prepares the files into the force filed descriptor files, which is read by the LEaP program for molecular modeling. The LEaP program then creates the topology files and Amber coordinates, which is then used in the MD simulation. The Sander program performs the MD simulation by fixing the temperature, pressure, and pH of the defined system. Lastly, the analysis part is performed by the ptraj module, which calculates the RMSD, RMSF, radius of gyration (Rg), H-bonds, and cross-correlation functions.

 

CHARMM-GUI

CHARMM-GUI is a simulation tool for the analysis of macromolecular dynamics and associated mechanical attributes. It performs standard MD simulations by using state-of-the-art algorithms for time stepping, long-range force calculation, and periodic images. Various analyses, such as energy minimization, crystal optimization, and normal mode analysis, can be performed using CHARMM.

 

NAMD

The simulations of much large biomolecular systems are performed using NAMD. The NAMD is available free of charge. The source code documentation including a set of compiled binary files configured with various parallel source software for calculations are freely available to the user. It supports massively parallel CUDA technology. NAMD can be used with graphical user interface software VMD. The simulation can be set and analyzed using the VMD as an interface. It is also compatible with AMBER and CHARMM.

 

Drug-likeness

Drug-likeness is a qualitative concept used in drug design. The molecules with inherent physicochemical and therapeutic features are termed as drug-like. The phenomenon is known as drug-likeness. Drug-likeness analysis generates predictive models for optimizing the pharmacokinetic properties. This analysis filters the compound libraries to remove unlikely molecules from the consideration. Neural network models categorized 83% of Comprehensive Medicinal Chemistry (CMC) database molecules and ~65% of MDL Drug Data Report (MDDR) molecules as drug-like. It also clustered 73% of compounds from Available Chemical Directory (ACD) as non drug-like compounds.

 

Drug-Likeness Analysis Tools Online (Freeware) Tools

Molinspiration: An online tool for the molecular property calculation and biological activity predictions. The molecular properties namely molecular weight, log P, polar surface area (PSA), number of rotatable bonds, molar volume, number of hydrogen bond donors and acceptors (HBDs and HBAs) can be predicted. It also predicts the molecular affinity for the targets namely GPCR, ion channels, nuclear and kinase receptors. Molecular structure can be manipulated and processed till arriving with optimal characteristics. Website: http://www.molinspiration.com/

 

Molsoft: It provides molecular property prediction, structure prediction, binding site prediction, drug target ranking, 2D to 3D conversion of molecules, data set clustering, QSAR model building, 3D pharmacophore construction and search, and molecular visualization services. Website: http://molsoft.com/mprop

 

Chemicalize.org: A free web-based drug-likeness tool from ChemAxon. It has tools namely calculation, structure search, document search and web viewer. The query structures can be submitted using MarvinSketch Java applet. The molecular properties such as polarizability, log P, log D and pKa can be calculated. Website: http://chemicalize.org

 

PASS; Prediction of activity spectra for substance: It is a software application (available online and download) to predict the biological activity spectra of drug-like molecules. It estimates the biological activity profile of virtual molecules based on their structure. It also predicts carcinogenicity, mutagenicity, teratogenecity and embrotoxicity informations. Website:http://www.way2drug.com/PassOnline/predict.php

 

PreADMET: A web-tool for drug-likeness analysis and absorption, distribution, metabolism and excretion (ADME) predictions. It also supports in toxicity predictions and molecular visualization. A commercial tool (PreADMET 2.0) also available. Website: http://preadmet-bmdrc.kr/druglikeness

 

ALOGPS2.1: Virtual Computational Chemistry Laboratory (VCCLAB) features the ALOGPS2.1 (interactive on-line predictor) for predicting molecular water solubility, log P, pKa, Log D, Log W and Log S values. It is developed based on the associative neural network (ASNN). Website: www.vcclab.org/lab/alogps

Lipinski’s rule

The impression of physicochemical assets confines the solubility and permeability of drugs introduced during Lipinski’s analysis of the Derwent World Drug Index. Lipinski et al. confirmed that orally administered drugs are expected to feature in parts of chemical space distinct by a restricted range of molecular characteristics. The criteria of physicochemical properties for Lipinski’s rule are mentioned in Table.

 

Physicochemical properties for Lipinski’s rule.

 

The term Lipinski’s “rule of five” (RO5) originated from the molecular properties of drugs which are found to be multiples of five. In this method, drugs and nondrugs molecules are distinguished by a definite range of physicochemical accounts and molecular characteristics.

 

Veber rule

Veber et al., in a paper released in 2002, proposed the following:

Irrespective of molecular weight, the important predictors for good oral bioavailabilityare lowered molecular flexibility, which is estimated by the extent of rotatable bonds, and low polar surface area or total hydrogen bond tally including both donors and acceptors. In the extensive set of data, filtering compounds with poor oral bioavailability from those with satisfactory value, the frequently functional molecular weight endpoint at 500 does not itself suggestively separate compounds. According to the rule, the criteria for compounds with good bioavailability are mentioned in Table.

 

Veber rule’s criteria for compounds.

 

Teague Rule of Three (RO3): Teague et al described the required molecular features of drugs. According to the rule (RO3), the molecules with the molecular weight in the range of 100 to 350 Da and clog P in the range of 1 to 3 are predicted to have better physicochemical properties.

Oprea Rule of Three: Oprea conducted the study on MDDR, CMC, Current Patents, Fast-alert, New Chemical Entities and ACD to assess the parameters responsible for bioavailability. According to the study the following molecular characteristics are essential for promoting the drug bioavailability.

• Number of rings, > 3

• Number of rigid bonds, > 18

• Number of rotatable bonds, > 6

Norinder Rule of Two (RO2): Norinder rule describes the required molecular features for crossing the blood-brain barrier (BBB). This rule accounts the nitrogen and oxygen atom count.

• Rule 1: The sum of nitrogen and oxygen atoms (N+O) should be less than (<5). It indicates the better molecular BBB permeation characteristics.

• Rule 2: The calculated log P value should be more than 0 (> 0).

The number obtained by subtracting the sum of nitrogen and oxygen atoms from log P is also indicative of better BBB permeability. These rules cannot explain the pharmacodynamic nature of the molecules. Peptidomimetics, transporter substrates and natural products do not obey these rules, possibly due to transporter effects. Drug-likeness filters recognize the compounds which resembles with existing drugs. But the new class compounds can not be identified and is a major challenge.

 

Traffic lights: Lobel et al., modified the RO5, based on the rules generated by other teams (Veber, Ghose, Wenlock, Monika). These molecular parameter prioritizations are known as traffic lights (TLs).

• Molecular weight, < 400 Da

• Log P, < 3

• Solubility at pH 6.5, < 50 mg/L

• Polar surface area (PSA), < 120 Å2

• Rotatable bonds, < 7

PhysicoChem scores are calculated to know the in silico prediction correlations. The score ranges from 0 to 1. The lower score is indicative of good in silico correlation.

 

 

Conclusion

CADD has immensely helped medicinal chemistry researchers to bypass or hasten multiple procedures in the drug design and discovery to find out potent clinical candidates in a short period of time. CADD is very useful in critical steps such as the hit-to-lead discovery and lead optimization; therefore, it paves the way for time as well as cost deduction. Discovery through creating three dimensional structures of ligand and protein, simulation, prediction of binding interactions and energy would be a very tedious and time-consuming task. However, as compared to conventional drug design and discovery, CADD has several advantages and is classified into SBDD and LBDD. Docking, molecular dynamics, and pharmacophore modeling are the essential steps in SBDD. Similarity search, QSAR model, and pharmacophore modeling are part of LBDD. In

the drug discovery paradigm, establishing the drug-likeness with the assistance of Lipinski’s rule of five or Veber’s parameters or rule of three could be a key approach, which determines the drug-like candidates in a reasonable quick timeline. The different in silico models predict various parameters of lead compounds, aqueous solubility, lipophilicity, ionization state, permeability, distribution, metabolism, and excretion. Hence, a detailed pharmacokinetic profile obtained from in silico methods would facilitate a robust approach in drug design, discovery, and development. Nonetheless, we have many innovative techniques in medicinal chemistry, and the discovery of advances must be encouraged to further reduce cost and time duration.

 

Fragment-based drug design

Fragment-based drug design (FBDD) is a biochemical and biophysical method to detect very small molecules or fragments, which can bind to specific targets and help in engrossing drug molecular leads. Generally, it starts with screening and identification of low molecular weight compounds with fewer convolutions for binding to a specified target. The fragments must be small in size to avoid unsuitable interactions, i.e., the identified fragments generally show identical binding affinity, which is advantageous for further optimization. Also, as the fragments are very small for binding to targets and thus after identification to enhance the binding potential of fragments, further strategies like fragment linking, merging, and growing can be applied.