awk or other bioinformatics tools to filter vcf

Related searches

I am trying to filter some lines in a vcf file, here is an example of lines:

1   10505   rs548419688 A   T   100 PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9632;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;E
UR_AF=0;SAS_AF=0;AA=.|||;VT=SNP
1   10506   rs568405545 C   G   100 PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9676;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;E
UR_AF=0;SAS_AF=0;AA=.|||;VT=SNP
1   10511   rs534229142 G   A   100 PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9869;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;E
UR_AF=0;SAS_AF=0;AA=.|||;VT=SNP
1   10539   rs537182016 C   A   100 PASS    AC=3;AF=0.000599042;AN=5008;NS=2504;DP=9203;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;E
UR_AF=0.001;SAS_AF=0.001;AA=.|||;VT=SNP
1   10542   rs572818783 C   T   100 PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9007;EAS_AF=0.001;AMR_AF=0;AFR_AF=0;EU
R_AF=0;SAS_AF=0;AA=.|||;VT=SNP

Say I want to extract lines with AMR_AF larger than 0.5, but couldn't figure out how to use Awk regular expressions to do such job. Tried vcftools, but that didn't work.

Could you please try following.

awk 'match($0,/AMR_AF=[0-9]+\.[0-9]+|AMR_AF=[0-9]+/) && substr($0,RSTART+7,RLENGTH-7)>0.5'  Input_file

Explanation: Using match function of awk to match regex AMR_AF= digits.digits OR AMR_AF=digits and whenever this regex gets matches on line then it sets RSTART and RLENGTH variables. &&(AND condition) to check if sub-string value of RSTART+7 to till RLENGTH-7 value is greater than 0.5 then print that line.

Filter VCF with bash commands, A G 1244.77 PASS AC=1;AF=0.500;A VCF Merge using Vcf Tools. Hello, I tried combining two vcf files using vcf tools merge. The problem I am� 99% Match on Online Bioinformatics. Start searching with Visymo.com.

You can split the line on the field you choose and examine whether the numeric value of the element just after the split is larger than your threshold.

In some more detail, splitting the input yes,foo=2,bar=0.23,baz=1 on ,bar= will yield an array containing yes,foo=2 and 0.23,baz=1. In Awk, if you compare the second element to 0.2, it will simply convert as much as it can from the beginning of the value into a number and then perform a numeric comparison.

Thus

awk '{ split($0, x, /[\t;]AMR_AF=/) } x[2]>0.5' file.vcf

should do what you want. We split the line into x and examine the numeric value of x[2].

The [\t;] in the regex allows for either a tab or a semicolon before the field's name; to be perfectly general, perhaps you should even use (^|[\t;]) to also permit the match to happen at beginning of line.

If you want to parametrize this, maybe try

awk -v field="AMR_AF" -v thres=0.5 '{ split($0, x, "(^|[\t;])" field "=")) } x[2]>thres' file.vcf

Recall that Awk processes the script for each input line from top to bottom, where each script statement has the form

[ condition ] [ { action } ]

As the square brackets indicate, both parts are optional -- if condition is missing, the action is taken unconditionally; if action is missing, it defaults to { print $0 }. So our script will first unconditionally split the line, then conditionally print it if x[2] is larger than the threshold.

GNU Awk can split on a multi-character field separator, so you could use -F '[\t;]AMR_AF=' too.

awk -F '[\t;]AMR_AF=' '$2>0.5' file.vcf

How can I extract only insertions from a VCF file?, Many tools produce VCFs that include the type of variant. \endgroup For simple VCF operations, I generally recommend to write a script. This may be #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3 S4 11 101 . bcftools view -v indels <vcf> | vt decompose - | bcftools view -H | awk� Find Bioinformatics Online. Find Quick Results from Multiple Sources

Using bcftools:

bcftools view -i 'INFO/AMR_AF > 0.5' myFile.vcf

See for more options from bcftools manuals.

How to subset a VCF by chromosome and keep the header , Bioinformatics Stack Exchange Regions can be specified either on command line or in a VCF, BED, or tab-delimited bcftools filter vcf_nocomp_merge_geno98.vcf.gz -r 4 | head -n 38 | colrm 100 Alternatively, you can use awk : fields as explained above, so just use normal text-parsing tools like the ones I suggested. Browse and Explore Bioinformatics Online Right Now at Helpful.Tips. Visit Our Results Now. Find Bioinformatics Online Right Now. Visit Helpful.Tips Today!

working with VCF files, Posted in HowTo, Bioinformatics and tagged VCF, sort, index, intersect, 1.sort VCF files; 2.index VCF files; 3.extract vcf from a bed region Try to avoid use awk/sed and other linux default command. Use the professional tools! A.vcf. ## fileformat=VCFv4.1 ##FILTER=<ID=PASS,Description="Passed all� awk or other bioinformatics tools to filter vcf. Viewed 71 times 1. I am trying to filter some lines in a vcf file, here is an example of lines: 1 10505

Shell tricks for one-liner bioinformatics, Example Linux/UNIX commands for quick bioinformatics analysis. if we had a file called data.vcf and wanted to pull out lines with a FILTER value of “PASS”,� I'm trying to filter a VCF file that has the following dummy flag values: PASS: All filters passed; Fa: Failed filter a; Fb: Failed filter b; Fc: Failed filter c; Fd: Faield filter d; Variants can fail one or more filters. Variants that fail multiple filters will be annotated with the corresponding flags separated by semi-colon.

Standard Linux tools such as awk ( Aho et al., 1987) or specific tools are widely used to handle those transformations. For VCF files, for instance, a large number of tools such as bioawk ( https://github.com/lh3/bioawk/ ), bcftools ( Danecek and McCarthy, 2017 ), SnpSift ( Cingolani et al., 2012 ), Genome Analysis Toolkit (GATK) ( McKenna et al., 2010) and hail ( https://hail.is/) (C.Seed et al., manuscript in preparation) can be used to select specific variants.

Comments
  • Welcome to SO, good that you have let us know that you tried few things, please do add those efforts in your question too.
  • Also please be clear which occurance of string you want to check for? As it is not clear from your question.
  • The vcf tag is for a calendar file format; surely this is something else?
  • Try awk '{ split($0, x, /[\t;]AMR_AF=/) } x[2]>0.5' file.vcf
  • There are no lines where AMR_AF is larger than 0.5 in your example.
  • thank you very much for your speedy response! just a quick question: what if I want to extract the value between AMR_AF= and ;AFR_AF, and print the numeric value out, using regular expression?
  • That should be easy to figure out; split twice. You can also use the match() logic from RavinderSingh13's answer and calculate offsets from RSTART and RLENGTH to figure out the indices to extract the substring but I find it rather cumbersome.