Information about genetic variants in biomedical literature is already extensive and will increase exponentially in future as we are embarking on whole genome sequencing using high throughput next generation sequencing (NGS) technology While it is time-consuming to manually curate genetic variants from such an extensive and ever growing literature, automated tools that speed up curation are evolving or already in use . Although automated curation is fast, nomenclature and presentation issues of variants as well as false positive and negative results inherent to the process lower the sensitivity and specificity. Manual curation is adopted as the gold standard and used to compare the qualitative measures of automated tools. Most databases of variants rely on manual curation for data extraction and entry.
While manual curation is the gold standard method for curation of variants, it can turn out to be time-consuming on a large scale thus necessitating the need for automation.Here we discuss the steps and difficulties in curation with their possible solutions, automated curation, aspects of interpretation of the variants and importance of following a standard nomenclature of variants.
A common practice is to select a genetic disease and then consider all its linked genes for curation. Which and how many genes are required for curation is dependent upon the findings from linkage analysis, genotype-phenotype correlation and genome wide association study (GWAS). Information from genetic testing and counseling bodies, online Mendelian inheritance in man (OMIM), and phenotype-specific online databases (such as catalogue of somatic mutations in cancer (COSMIC)) are also useful for identifying genetic links to disease.
Difficulties in curation and their possible solutions
Curation of variants requires meticulous work and the curator has to tackle many difficulties such as nomenclature issues, typos, and errors in papers. Many problems can be gene-specific: for example, in the past, legacy names for HBB gene variants used to be assigned by geographical names such as Hb N-Baltimore, Hb D-LA, Hb O-Panjab, etc. In contrast, legacy names of CFTR variants followed the standard numbering of amino acids. Regular expression of variants and correction factors also differ among genes. During curation, nomenclature issues are frequent imposing major difficulties.
Variants are presented in a paper following a specific nomenclature. However, disparate conventions as well as nonstandard naming of variants across the literature may be encountered. To maintain uniformity, standard nomenclature recommended by HGVS (http://www.hgvs.org/mutnomen) should be followed.
(i) Numbering of variants at the cDNA level should start from the translation initiation site. Naming of substitution variants should be in the form ‘c.# wild-type nucleotide>mutated nucleotide’(such as c.372G>A).
(ii) Amino acid numbering should start from the translation initiation codon counted as first amino acid. Use of three-letter codes for amino acids should be emphasized; however, one letter codes are also extensively used for presentational ease. For missense mutations, the format is ‘p.wild-type amino acid# mutated amino acid’ (such as p.Thr124Ile or p.T124I).
(iii) Genomic numbering starts from the first nucleotide of the gene and should be represented by the suffix ‘g.’. The format for genomic numbering involving substitution is ‘g.# wild-type nucleotide>mutated nucleotide’ (for example g.345555A>G).
Referencing issues of single nucleotide polymorphisms
Single nucleotide polymorphisms (SNPs) may be presented in NCBI dbSNP as referenced complementarily to that of reference sequence. To get the correct nucleotide change and the frequency data of alleles, alignment of sequence flanking the SNP with the reference sequence is required.
A gene can have more than one alias but papers may not be consistent in using a particular name for it. To standardize the use of gene symbols, the human genome organization gene nomenclature committee (HGNC) approved that gene symbols (http://www.genenames.org/) should be used by all authors, curators and databases.
To curate variants, automated curation tools use expression patterns such as annotation of variants, contextual features, distance-metrics, graph-metrics, and rule-based systems such as pattern matching. For example, two common tools MEMA and MuteXt have been developed that can use a dictionary search for protein and gene names and differentiate protein names based on the measurement of word proximity distance for extraction of variants: a variant in a paper is closer to the names of its related genes/proteins rather than names of other proteins/genes that are also present in the pape. There are many automated tools that mainly differ on extraction strategies and efficiency.
Though variants curated using automated tools need to be validated for false positive (FP) and false negative (FN) results.. Mutalyzer (http://www.lovd.nl/mutalyzer/), a web-based software application, can be used for assessment of nomenclature of the variants extracted from publications . Similarly, a web-based software application called COMUS can detect not only variants from sequencing files (AB1 files from Sanger sequencing) but also check the HGVS compliant nomenclature of variants.
Limitations of automated curation
Some inherent limitations of automated tools that reduce effectiveness are as follows. (i) They are not devised to collect information on experimental procedures and effects of the variants. (ii) They may not effectively extract variants listed in the figures in the papers. (iii) They have primarily been optimized to extract point mutations only. (iv) They may not grab the artificial variants cited in papers properly.
The curation of genetic variants from literature is an essential part of genetic testing as well as various other researches including building of mutational databases. Ideally, it is imperative that curation should produce 100% sensitivity and specificity. Studying the genotype-phenotype correlation or the diagnostic significance of variants is challenging and requires specific as well as reproducible results. It is crucial that papers describing the genetic variants get clear exposure to the information seekers. To avoid or at least reduce difficulties in curation, authors should write papers with variants focusing primarily on standard nomenclature and become more specific by mentioning variants at cDNA level.