If you know me, you know that I’m not a computational biology kind of gal. I’m perfectly content to think of my laptop as the magic box that lets me look at cats with captions and write WordPress posts. I say hats off to you, real computational biologists, you people who can truly understand how a principle component analysis works.*
But last year I did a stint in a computational lab, so somewhat reluctantly I learned Shell scripting. (Shell scripting is like Sesame-Street-level computer programming.) The other day I found myself needing to search a large mass of protein sequence for a motif. How to do it? The Shell command grep
.
Here’s the thing. If you have a Macintosh, then grep
is a super-pimped out search feature that is inside your computer right now. For searching inside large text files, it’s way more powerful than Spotlight or whatever they’re calling that magnifying glass in the corner these days. You’ll need to know some kindergarten-level computer programming to be able to use it, but it’s totally worthwhile.
Start by taking the stuff you want to search and pasting it into a text file. It’s important that it’s plain text and that there aren’t any spaces in the file name. Save and quit.
Open the computer program Terminal. It’s in Applications > Utilities.
Type ls
and hit return. That gives you a top-level list of all the folders on your computer. Type cd
and the name of a folder to open it. Keep going until you’ve opened the folder that has your file in it. (Your folders had better not have spaces in their names!)
Type grep whatyou'relookingfor nameoftextfile
.
But can’t ⌘F do the same thing just as well? Ah, here is where grep
is a pimped-out search feature. It can use regular expressions. Here’s a longer explanation of regular expressions, but in short, they let you specify any sort of search criteria you could possibly imagine. Here’s a real simple example where I remember that the word I’m searching for starts with w, but I don’t know what comes next:
And that’s the magic of grep
!
* “Principal component analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components,” says Wikipedia.