Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 A numerical measure, falling between-1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship A value near zero indicates no relationship between the variables Covariation( In sequences)(共变) Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNa or protein molecules Coverage( or depth)(覆盖率/厚度) The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a high-quality base is defined as one with an accuracy of at least 99%(corresponding to a PHRED score of at least 20) Database(数据库) A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also object-oriented database. Relational database Dendogram A form of a tree that lists the compared objects(e.g, sequences or genes in a microarray analysis)in a vertical order and joins related ones by levels of branches extending to one side of the list Depth(厚度) See coverage Dirichlet mixtures Defined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains(blocks) Distance in sequence analysis(序列距高) The number of observed changes in an optimal alignment of two sequences. usually not counting gaps DNA Sequencing(DNA测序) The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide(A, C, G or T)with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are mbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised Domain(功能域) a discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 A numerical measure, falling between - 1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship. A value near zero indicates no relationship between the variables. Covariation (in sequences)(共变) Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNA or protein molecules. Coverage (or depth) (覆盖率/厚度) The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20). Database(数据库) A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also Object-oriented database, Relational database. Dendogram A form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list. Depth (厚度) See coverage Dirichlet mixtures Defined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks). Distance in sequence analysis(序列距离) The number of observed changes in an optimal alignment of two sequences, usually not counting gaps. DNA Sequencing (DNA 测序) The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised. Domain (功能域) A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function. 130
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 Dot matrix(点标矩阵图) Dot matrix diagrams provide a graphical method for comparing two sequences One sequence is written horizontally across the top of the graph and the other along the left-hand side Dots are placed within the graph at the intersection of the same letter appearing in both sequences. a series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window Draft genome sequence(基因组序列草图 The sequence produced by combining the information from the individual sequenced clones(by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes DUsT(一种低复杂性区段过濾程序) A program for filtering low complexity regions from nucleic acid sequences Dynamic programming(动态规划法 a dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence companIsons EMBL(欧洲分子生物学实验室,EMBL数据库是主要公共核酸序列数据库之 European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases EMBnet(歐洲分子生物学网络) EuropeanMolecularBiologyNetworkhttp://www.embnet.orgwasestablished in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY Entropy(熵) From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy Erdos and renyi law In a toss of a fair coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment EST(表达序列标签的缩写)
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 Dot matrix(点标矩阵图) Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontally across the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window. Draft genome sequence (基因组序列草图) The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes. DUST (一种低复杂性区段过滤程序) A program for filtering low complexity regions from nucleic acid sequences. Dynamic programming(动态规划法) A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons. EMBL (欧洲分子生物学实验室,EMBL 数据库是主要公共核酸序列数据库之 一) European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases. EMBnet (欧洲分子生物学网络) European Molecular Biology Network: http://www.embnet.org/ was established in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY. Entropy(熵) From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy. Erdos and Renyi law In a toss of a “fair” coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment. EST (表达序列标签的缩写) 131
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 See Expressed Sequence Tag Expect value(E)(E值) E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning Expectation maximization ( sequence analysis) An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement Exon(外显子 Coding region of DNA. See CDS Expressed sequence Tag(EsT)(表达序列标签) Randomly selected, partial CDNA sequence; represents it's corresponding mRNA dbEST is a large database of ESTs at GenBank, NCBI FASTA(一种主要数据库搜索程序) The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for smal matches called words". Initially, the scores of segments in which there are multiple word hits are calculated (init1). Later the scores of several segments may be summed to generate an initn " score. An optimized alignment that includes gaps is shown in the output as"opt". The sensitivity and speed of the search are inversely related and controlled by the k-tup variable which specifies the size of a word"(Pearson and Lipman) Extreme value distribution(极值分布) Some measurements are found to follow a distribution that has a long tail which decays at high values much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example These scores can reach very high values, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution which follows a double negative exponential function after Gumbel False negative(假阴性 A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 See Expressed Sequence Tag Expect value (E)(E值) E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning. Expectation maximization (sequence analysis) An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement. Exon (外显子) Coding region of DNA. See CDS. Expressed Sequence Tag (EST) (表达序列标签) Randomly selected, partial cDNA sequence; represents it's corresponding mRNA. dbEST is a large database of ESTs at GenBank, NCBI. FASTA (一种主要数据库搜索程序) The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson and Lipman) Extreme value distribution(极值分布) Some measurements are found to follow a distribution that has a long tail which decays at high values much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example. These scores can reach very high values, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution, which follows a double negative exponential function after Gumbel. False negative(假阴性) A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results. 132
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 False positive(假阳性) a positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative Feed- -forward neural network(反向传输神经网络) Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a feed-forward direction, resulting in output at the final layer. See also Neural network Filtering(window size) During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot indicating a match, is generated only if a certain minimal number of matches occur Filtering(过滤) Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and dUST. Finished sequence(完成序列 Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps Fourier analysis Studies the approximations and decomposition of functions using trigonometric polynomials Format(file)(格式) Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format Forward-backward algorithm sed to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach FTP( Fille Transfer protoco)(文件传输协议) Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn will make a specific portion of its tile system available for FTP access providing that the client is able to supply a recognized user name and password to the server Full shotgun clone(鸟枪法克隆) A large- insert clone for which full shotgun sequence has been produced
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 False positive (假阳性) A positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative. Feed-forward neural network (反向传输神经网络) Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a “feed-forward” direction, resulting in output at the final layer. See also Neural network. Filtering (window size) During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot, indicating a match, is generated only if a certain minimal number of matches occur. Filtering (过滤) Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST. Finished sequence(完成序列) Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps. Fourier analysis Studies the approximations and decomposition of functions using trigonometric polynomials. Format (file)(格式) Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format. Forward-backward algorithm Used to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach. FTP (File Transfer Protocol)(文件传输协议) Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn, will make a specific portion of its tile system available for FTP access, providing that the client is able to supply a recognized user name and password to the server. Full shotgun clone (鸟枪法克隆) A large-insert clone for which full shotgun sequence has been produced. 133