Bioinformatics, data science, python, R, statistics, papers observations, art, food, cute stuff
Don't wanna be here? Send us removal request.
Text
Little about anaconda
Anaconda enviroment for specific python version
IPython or jupyter notebook doesn't support python 3.6 yet, so here is a command to build an enviroment with python 3.5 and use the notebooks
conda create -n python35 python=3.5 ipython
source activate python35
python -m ipykernel install --user --name python35 --display-name "Python 3.5"
credits:
create an enviroment
import error
pykernel
10 notes
·
View notes
Text
How create your bed file manually
Using UCSC table browser
About bed format (UCSC link, bedtools link)
' BED files are used to define capture regions in the assembly and can be generated by hand (table browser) or automatically (plastid). These files are basically tab-separated text files whose extension has been changed to .bed.
This post use informations from biostar post
Download a bed file for the canonical transcripts (normally used as intervals for variant calling)
Assembly: Feb. 2009 (GRCh 37/hg19);
Track: UCSC Genes;
Table: knownCanonical;
If you want specific genes click from identifiers (names/accessions) click in paste or upload (eg.: BRCA1, BRCA2, EGFR, DMD, CFTR), to select all genes just ignore this subject;
Output format: selected fields from primary and related tables
select get output;
Select fields from hg19.knowCanonical: chrom, chromStart, chromEnd, transcript;
Select fields from hg19.kgXref: geneSymbol, refseq;
Click in get output.
Now you have the canonical transcript and its refseq that can be used to filter the positions in exon level
Download a bed file for exons in specific genes or all genes(normally used for bam coverage detection)
Assembly: `Feb. 2009 (GRCh 37/hg19);
Track: UCSC Genes or RefSeq Genes (preferable);
Table: knownGene (UCSC) or refGene (RefSeq);
If you want specific genes click from identifiers (names/accessions) click in paste or upload (eg.: BRCA1, BRCA2, EGFR, DMD, CFTR), to select all genes just ignore this subject;
Output format: BED - browser extensible data;
Select get output;
Select Coding Exons (for exome sequencing for example, but you can choose Exon plus splicing regions or other fields)
It will give you all cds exons present in all possible transcripts for the genes that you selected or all genes
Now you can filter all the exons from the canonical genes that you dowloaded in the first try
Create bed files with python using plastid lib
Plastid example
Installation with conda:
conda install -c bioconda plastid
2 notes
·
View notes
Text
Paper annotations #1
tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine
Intro
Each SNP record in dbSNP (Database for Short Genetic Variations) is assigned a stable and unique variant accession identifier (RSID), which is linked to aggregated information (associated gene, functional consequences and allele frequency).
NHGRI-EBI GWAS Catalog is a collection of genome-wide set of genetic variants in different individuals associated with a trait [1].
For genomic variant information in cancer, COSMIC contains expert-curated data of somatic mutations [2].
CIViC is an open-acess, open-source knowledgebase for expert-crowdsourced of clinical interpretation of variants in cancer [3].
DisGeNET is a recent platform integrating information on gene-variation-disease associations from several public data sources and the literature [4].
"The first version of tmVar is a high-performance software for external evaluations comparing formats in the PubMed article and re-writing them in HGVS formats (e.g. p.Pro12Ala). However, HGVS names can still be ambiguous: one can often be linked to multiple RSIDs (e.g. rs767209585 and rs773973301 are both associated with p.Pro12Ala). Indeed, on average, one protein mutation in HGVS name maps to more than ten RSIDs".
Why not use HGVS genomic nomeclature? HGVS isn't just the protein nomeclature, it considers the gene, genomic location and protein location.
"in this work we first extended tmVar to automatically normalize the variant mentions and map them to standard dbSNP RS numbers."
It includes variants not present in dbSNP that could be considered rares?
Using the human gold standard they compare tmVar 2.0 against SETH, another automated tool to text-mining mutations [5] and had nearly 90% in F-measures.
about F1
"Our analysis includes: (i) comparing the text-mined PMID-RSID pairs with annotated dbSNP data, (ii) analyzing variants curated in ClinVar and (iii) discovering novel connections between variants, gene and diseases"
"Our investigation revealed 161 178 missing RSID-PMID links in dbSNP and 41 889 RSIDs not found in ClinVar. Moreover, our results also include over 120 000 rare variants (MAF 0.01) in nearly 4000 genes across the genome which are presumed to be deleterious and are not frequently found in the general population."
MAF isn't enough to considered a variant patogenic, maybe more information had been considered
Materials and methods
"tmVar applies ML approach to tag mutation mentions in free txt, detecting terms that represent variants of multiple types (SNV, insertion, deletion, etc) and sequence context (genomic, transcript and protein) and returns its results in HGVS form".
"Before we performed normalization, we first built a comprehensive lexicon containing all possible mappings between variant mentions and RSIDs, harvested from three difference sources: dbSNP, Clinvar and PubMed".
Two main strategies were used to find corresponding RSID: pattern matching '[Gene/Protein] ([DNAMutation] with [RSID])' and a list of candidate RSIDs for search using lexicon. For disambiguation, they use global information in the entire article and/or variant-associated gene information, also using GNormPlus an end-to-end and open source system that handles both gene mention and identifier detection.
The frequency data used as population frequency come from 1000 Genomes Phase 3, Exac, NHLBI GO ESP and gnomAD.
Results
"The tmVar RS results (62452 RS numbers in 9782 genes) were categorized using dbSNP and ClinVar annontations along multiple facets, including functional consequences (syn, non-syn, etc) based on RefSeq mRNA annotations, minor allele frequency (MAF), and clinical significance in order to prioritize their biological significance and assess their clinical impact".
Discussion
According to the table 4, OSIRIS had better results than tmVar2. So, OSIRIS could be used with tmVar2.
"our results could be used by other computational methods in bioinformatics research such as connecting genotypes with phenotypes and/or modeling gene-disease-variant relations [DisGeNeT][6]"
2 notes
·
View notes
Photo

Plasmids, DNA art, Science art, watercolor print, science illustration, microbiology, bacteria, microbes, biology art, DNA, virus, giclee
16 notes
·
View notes
Photo
I drew a big ol’ Steven Universe scramble, I really like how it turned out! If you come see me at Otakon I’m gonna be selling it as a print.
Steven Universe is a good show and I’m super glad kids have it! Some of the episodes are hit or miss for me ~*~AS A CRITIC~*~, but that’s every show, and when the writing is on point it’s the best cartoon airing (My favorite episodes have definitely been The Test and Keystone Motel, BUT, THERE’S A LOT OF GOOD ONES…). I hope we get more of it and more things like it down the line: longform serial storytelling with a unique world and aesthetic. OKAY, that’s all, see you later.
P.S. read my webcomic Paranatural if you haven’t already :^)
85K notes
·
View notes