Author:
Lee Sung Jong,Joo Keehyoung,Sim Sangjin,Lee Juyong,Lee In-Ho,Lee Jooyoung
Abstract
We built a method of sequence-structure alignment (called CRFalign) which improves upon a base alignment model based on HMM-HMM comparison by employing pairwise conditional random fields (pCRF) in combination with nonlinear scoring functions of structural and sequence features. The total scoring function consists of a base scoring part based on HMM-HMM profile comparison plus additional nonlinear scoring part which is implemented by a set of gradient boosted regression trees. In addition to sequence profile features, various structural features are employed including secondary structures, solvent accessibilities, environment-dependent properties that give rise to position-dependent as well as environment-dependent match scores and gap penalties. Training is performed on reference alignments at superfamily levels or twilight zone chosen from the SABmark benchmark set. We found that our alignment method produce relative improvement in terms of average alignment accuracies, especially for the alignment of remote homologous proteins. We found that our alignment method produced (by using Modeller) better modeling results especially in the relatively hard targets compared with other methods. CRFalign was successfully applied to the stages of fold recognition and multiple sequence alignment in CASP11 and CASP12 competition on protein structure predictions.
Publisher
Cold Spring Harbor Laboratory