Post-transcriptional regulation is carried out by RNA-binding proteins (RBPs) that
bind to specic RNA molecules and control their processing, localization, stability and
degradation. Experimental studies have successfully identied RNA targets associated
with specic RBPs. However, because the locations of the binding sites within the
targets are unknown and because RBPs recognize both sequence and structure elements
in their binding sites, identication of RBP binding preferences from these data remains
challenging.
The unifying theme of this thesis is to identify RBP binding preferences from experimental
data. First, we propose a protocol to design a complex RNA pool that represents
diverse sets of sequence and structure elements to be used in an in vitro assay to eciently
measure RBP binding preferences. This design has been implemented in the RNAcompete
method, and applied genome-wide to human and Drosophila RBPs. We show that
RNAcompete-derived motifs are consistent with established binding preferences.
We developed two computational models to learn binding preferences of RBPs from
large-scale data. Our rst model, RNAcontext uses a novel representation of secondary
structure to infer both sequence and structure preferences of RBPs, and is optimized
for use with in vitro binding data on short RNA sequences. We show that including
structure information improves the prediction accuracy signicantly. Our second model,
MaLaRKey, extends RNAcontext to t motif models to sequences of arbitrary length,
and to incorporate a richer set of structure features to better model in vivo RNA secondary
structure. We demonstrate that MaLaRKey infers detailed binding models that
accurately predict binding of full-length transcripts. |