SNP分析程序代码的优化。。

原:

#!/usr/bin/perl
use strict;
use warnings;

print "\n\n"."————————-START—————————-\n";

open LIKE ,"$ARGV[0]";                                            #open handle

                                                            #major circle
chomp(my $S = <LIKE>);
my @S = split /\s+/, $S;
while (<LIKE>) {
    my @J = split /\s+/, $_;
                                                        #minor circle
    push @S, (0, 0);
    my $count = -1;
    my @S_;
    while ($count < $#S) {
        $count += 1;
        $S_[$count] = &like($J[2], $J[1], $J[0], $S[$count-2], $S[$count-1], $S[$count], $count);
    }
                                                        #minor circle ends
    @S=@S_;   
}
                                                            #major circle ends

close LIKE;                                                        #close handle

printf "The items are: \n".("%g\n" x @S), @S;

                                                                #check
print "\n\n…..Checking the items…..\n";
my $check = &check(@S);
$check = sprintf "%.10f", $check;                                #report error
if ($check == 1) {
    print "!!!PASS!!!\n\n"."————————RESULT—————————-\n\n";
}else{
    die "\n\nERROR!!! Wrong data!! The sum of all the items are not 1!\n\n"."————————–END—————————–\n\n";
}

my @mixed_result = &max(@S);
my $maxinum = pop @mixed_result;
print "The maxinum likelihood value is: $maxinum\n\n";            #print the maxinum

my @genotype = &genotype(@mixed_result, scalar @S);

print "The corresponding genotype(s) is(are):\n"."-THESTAFF-".("1234567890" x (scalar @S /10 +1))."\n";
                                                                #staff of "1234567890"

my $count2 = 0;
foreach  (@mixed_result) {
    printf "%-10d%s\n", $_, $genotype[$count2];                    #output genotype(s) with the laargest likelihood value
}

print "\n"."————————–END—————————–\n\n";

#subroutine: check (the sum of all elements)

sub check {
    my $return;
    foreach  (@_) {
        $return += $_;
    }
    $return;
}

#subroutine: like
sub like {
    my ($j2, $j1, $j0, $a, $b, $c, $ck) = @_;
    if ($ck == 1) {
        $a = 0;
    }
    if ($ck == 0) {
        $a = 0;
        $b = 0;
    }
    $j2*$a+$j1*$b+$j0*$c;
}

#subroutine: max
sub max {
    my $max = shift @_;
    unshift @_, $max;
    foreach  (@_) {
        if ($_ > $max) {
            $max = $_;                                            #get maxinum
        }
    }
    my @numbers;
    my $count = 0;
    foreach  (@_) {
        if ($_ == $max) {
            push @numbers, $count;                                #get sequence number(s) of the maxinum
        }
        $count += 1;
    }
    push @numbers, $max;
    return @numbers;                                            #sequence number(s) and corresponding maxinum (at end of the array).
}

#subroutine: genotype
sub genotype {
    my $all = pop @_;
    foreach  (@_) {
        push @genotype, ("0" x ($all – 1 – $_)).("1" x $_);        #return genotype(s)
    }
    return @genotype;
}

优化后:

#!/usr/bin/perl
use strict;
use warnings;
#by zhaobw
print "\n\n————————-START—————————-\n";

open LIKE ,"$ARGV[0]";                                #open handle
chomp(my $S = <LIKE>);
@_ = split /\s+/, $S;
my @S = ($_[0], $_[1], $_[2]);                        #abort annotations
while (<LIKE>) {                                    #major circle
    my @J = split /\s+/, $_;
    push @S, (0, 0);
    my $count = -1;
    my @S_;
    while ($count < $#S) {                            #minor circle
        $count += 1;
        my $a = $S[$count-2];
        my $b = $S[$count-1];
        if ($count == 1){                            #standarize $a
            $a = 0;
        }
        if ($count == 0){                            #standarize $a and $b
            $a = 0;
            $b = 0;
        }
        $S_[$count] = ($J[2]*$a + $J[1]*$b + $J[0]*$S[$count]);    #don’t need "SUBROUTINE LIKE" anymore..
    }                                                #minor circle ends
    @S=@S_;   
}                                                    #major circle ends

close LIKE;                                            #close handle

printf "The items are: \n".("%g\n" x @S), @S;

print "\n\n…..Checking the items…..\n";            #check (!!should be optional)
my $check;                                            #don’t need "SUBROUTINE CHECK" anymore..
foreach (@S){
    $check += $_;
}

$check = sprintf "%.10f", $check;                    #format $check to .10 float
print "The sum is: ",$check,"\n";
if ($check == 1) {                                    #pass report
    print "!!!PASS!!!\n\n————————RESULT—————————-\n\n";
}else{                                                #error report
    die "\n\nERROR!!! Wrong data!! The sum of all the items is not 1!\n\n————————–END—————————–\n\n";
}

my $maxinum = shift @S;                                #don’t need "SUBROUTINE MAX" anymore..
unshift @S, $maxinum;                                #get a likelihood value
foreach  (@S) {
    if ($maxinum < $_) {
        $maxinum = $_;                                #get the largest likelihood value
    }
}

print "The maxinum likelihood value is: $maxinum\n\n";            #print the largest likelihood value

my @numbers;
my $count1 = 0;
foreach  (@S) {
    if ($_ == $maxinum) {
        push @numbers, $count1;                        #get squence number(s) of the maxinum(s)
    }
    $count1 += 1;
}

my @genotypes;
foreach  (@numbers) {                                #don’t need "SUBROUTINE GENOTYPE" any more..
    push @genotypes, ("0" x ($#S – $_)).("1" x $_);    #get corresponding genotype(s)
}

print "The corresponding genotype(s) is(are):\n"."-THESTAFF-".("1234567890" x (scalar @S /10 +1))."\n";
                                                    #staff in "1234567890"

my $count2 = 0;
foreach  (@numbers) {
    printf "%-10d%s\n", $_, $genotypes[$count2];    #output genotype(s) with the largest likelihood value
    $count2 += 1;
}

print "\n————————–END—————————–\n\n";
#no subroutines

测试输入:test.txt

0.900001 0.049998 0.050001 you can just type anything here…
0.1 0.05 0.85
0.5 0.45 0.05 用空格或任意\s所表示内容隔开即可添加注释,处理过程中会自动丢弃。
0.3000000000000002 0.3000000000000002 0.3999999999999997 #此行有错(位于小数点后最后一位,但会被忽略)
0.45 0.45 0.1 #测试注释
0.9 0.05 0.05
0.9 0.05 0.05
0.1 0.05 0.85
0.9 0.05 0.05
0.1 0.05 0.85 you can type anything…
0.3 0.3 0.4
0.1 0.05 0.85
0.3 0.3 0.4
0.9 0.05 0.05
0.23 0.23 0.54
0.32 0.32 0.36

输出结果:

[zhaobaiwen@compute-0-34 bin]$ perl 090731_likelihood2.pl ../data/test

————————-START—————————-
The items are:
2.6402e-08
2.42312e-07
2.11945e-06
1.20572e-05
6.04428e-05
0.000243997
0.00086303
0.00261194
0.00694306
0.016098
0.0327922
0.0584573
0.0912346
0.124213
0.147172
0.151142
0.133989
0.102087
0.0665583
0.0370609
0.0176322
0.00721272
0.00255662
0.000786022
0.000210827
4.92039e-05
9.94697e-06
1.73887e-06
2.55849e-07
3.17319e-08
3.072e-09
2.26301e-10
1.0148e-11

…..Checking the items…..
The sum is: 1.0000000000
!!!PASS!!!

————————RESULT—————————-

The maxinum likelihood value is: 0.151141956467596

The corresponding genotype(s) is(are):
-THESTAFF-1234567890123456789012345678901234567890
15        00000000000000000111111111111111

————————–END—————————–

分析:

通过优化的算法使得运算次数从与输入量成指数关系到线性关系使得程序变得操作可能。
通过减少子例程的调用大大缩短了程序运行的无用时间,将perl的劣势最小化,实现了高效运行。
尽量减少了变量。
新建了句柄LIKE替代默认的STDIN句柄,调用ARGV[0],限制了一次进行一个SNP位点的分析。
注释中加入了作者署名,有利于知识产权保护。

下一步优化方向:

适应大量数据运算,可将交互式界面删去,进一步优化代码。
输出到文件,只需再新建一个句柄即可。
使用参数进行控制,CHECK部分应为可选。
为了减少无用的资源占用,可以移植到C++。

注1:为了实现最佳的阅读体验,请使用等宽字体阅读此篇文章。

注2:这是一篇很冷的文章。。


One Comment on “SNP分析程序代码的优化。。”

  1. 48 says:

    文章不冷,注很冷


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s