For those not used to the terminology, FMTYEWTK stands for Far More Than You Ever Wanted To Know. This one is fairly light as FMTYEWTKs usually go. In any case, the question before us is, "How do you apply an edit against a list of files using Perl?" Well, that depends on what you want to do . . . .

The beginning

If you just want to read in one or more files, apply a regex to the contents, and spit out the altered text as one big stream, that is probably best done with a one liner such as the following:

perl -p -e "s/Foo/Bar/g" <FileList>

This command calls perl with the options -p and -e "s/Foo/Bar/g" against the files listed in the FileList. The first argument, -p, tells perl to print each line it reads after applying the alteration. The second option, -e, tells perl to evaluate the provided substition regex rather than reading a script from a file. The perl interpreter then evaluates this regex against every line of all (space separated) files listed on the command line, and spits out one huge stream of the concatenated fixed lines.

In standard fashion, perl allows options without arguments to be concatenated with following options for brevity and convenience. Therefore, the previous example is more often written:

perl -pe "s/Foo/Bar/g" <FileList>

Inplace editing

If you want to edit the files inplace, editing each file before going on to the next, that's pretty easy too:

perl -pi.bak -e "s/Foo/Bar/g" <FileList>

The only change from the last command is the new option -i.bak, which as you might expect tells perl to operate on files inplace, rather than concatenating them together into one big output stream. Like the -e option, -i takes an argument, in this case an extension to add to the original file names when making backup copies; for this example I chose .bak. Warning: If you execute the command twice, you've most likely just overwritten your backups with the changed versions from the first run. You probably didn't want to do that.

Note that since -i takes an argument, I had to separate out the -e option, which otherwise would have been added to the argument to -i, leaving us with a backup extension of .bake, unlikely to be correct unless you happen to be a pastry chef. In addition, perl would have thought that "s/Foo/Bar/" was the filename of the script to run, and would complain when it could not find a script by that name.

Multiple regexes

Of course, you may want to make more extensive changes than just one regex. If you simply want to make several changes all at once, you can do that fairly easily by simply adding more code to the evaluated script; each additional line of code should be separated by a semicolon (technically, you should place a semicolon at the end of each line of code, but the very last one in any code block is optional). For example, you could make a series of changes:

perl -pi.bak -e "s/Bill Gates/Microsoft CEO/g;
 s/CEO/Overlord/g" <FileList>

"Bill Gates" would then become "Microsoft Overlord" throughout the files. (Here, as in all examples, we ignore such finicky things as making sure we don't change "HERBACEOUS" to "HERBAOverlordUS"; for that kind of information, refer to a good treatise on regular expressions, such as Jeffrey Friedl's impressive book Mastering Regular Expressions, 2nd Edition. Also, I've wrapped the command to fit, but you should type it in as just one line.)

Doing your own printing

You may wish to override the behavior created by -p, which causes every line read in to be printed out, after any changes made by your script. In this case, you should change to the -n option. -p -e "s/Foo/Bar/" is roughly equivalent to -n -e "s/Foo/Bar/; print". This means we can do interesting stuff like the following, which removes lines beginning with hash marks (Perl comments, C-style preprocessor directives, etc.):

perl -ni.bak -e "print unless /^\s*#/;" <FileList>

Fields and scripts

Of course, there are far more powerful things you can do with this; for example, imagine a flatfile database, with one row per line of the file, and fields separated by colons, like so:

Bill:Hennig:Male:43:62000
Mary:Myrtle:Female:28:56000
Jim:Smith:Male:24:50700
Mike:Jones:Male:29:35200
...

Now let's say that you wanted to find everyone who was over 25, but paid less than $40,000. At the same time, you'd like to document the number and percentage of women and men found. This time, instead of providing a mini-script on the command line, we'll create a file, glass.pl, which contains the script we'll run. To run the query, the following will do the trick:

perl -naF':' glass.pl <FileList>

glass.pl contains the following:

BEGIN { $men = $women = $lowmen = $lowwomen = 0; }

next unless /:/;
/Female/ ? $women++ : $men++;
if ($F[3] > 25 and $F[4] < 40000)
    { print; /Female/ ? $lowwomen++ : $lowmen++; }

END {
print "\n\n$lowwomen of $women women (",
      int($lowwomen / $women * 100),
      "%) and $lowmen of $men men (",
      int($lowmen / $men * 100),
      "%) seem to be underpaid.\n";
}

Don't worry too much about the syntax, other than to note some of the AWK and C similarities; the important thing here and in later sections is to see some of the capabilities available to make these sorts of problems easily solvable in Perl. Several new features are used for this example; first, if there is no -e option to evaluate, perl assumes the first filename listed, in this case glass.pl, refers to a Perl script to be executed. Second, two new options make it easy to deal with field-based data. -a (autosplit mode) takes each line and splits its fields into the array @F, based on the field delimiter given by the -F (Field delimeter) option, which can be a string or a regex. If no -F option exists, the field delimiter defaults to ' ' (one single-quoted space). By default, arrays in Perl are zero-based, so $F[3] and $F[4] refer to the age and pay fields, respectively. Finally, the BEGIN and END blocks allow the programmer to perform actions before file reading begins and after all files have been dealt with, respectively.

File handling

All of these little tidbits have made use only of data from within the files being operated on. But what if you wanted to be able to read in data from elsewhere? For example, imagine that you had some sort of file that allows includes; in this case, we'll assume that include files are specified by relative pathname, rather than being looked up in some sort of include path. Perhaps the includes look like the following:

...
#include foo.bar, baz.bar, boo.bar
...

If you wanted to see what the file looked like with the includes placed into the master file, you might try something like this:

perl -ni.bak -e "if (s/#include\s+//) {foreach $file
 (split /,\s*/) {open FILE, '<', $file; print <FILE>}}
 else {print}" <FileList>

To make it easier to see what's going on here, this is what it looks like if we add in a full set of line breaks for clarity:

perl -ni.bak -e "
        if (s/#include\s+//) {
            foreach $file (split /,\s*/) {
                open FILE, '<', $file;
                print <FILE>
            }
        } else {
            print
        }
    " <FileList>

Of course, this only expands one level of include, but then we haven't provided any way for the script to know when to stop if there's an include loop. In this little example, we take advantage of the fact that the substitution operator returns the number of changes made, so if it manages to chop off the #include at the beginning of the line, it returns a non-zero (true) value, and the rest of the code splits apart the list of includes, opens each one in turn, and prints its entire contents. Handy shortcuts are used as well: if you open a new file using the name of an old file handle (FILE in this case), perl automatically closes the old file first; in addition, if you read from a file using the <> operator into a list (which the print function expects), it happily reads in the entire file at once, one line per list entry. The print call then prints the entire list, inserting it into the current file, as expected. Finally, the else clause handles printing non-include lines from the source, since we are using -n rather than -p.

Better file lists

The fact that it is relatively easy to handle filenames listed within other files indicates that it ought to be fairly easy to deal entirely with files read from some other source than a list on the end of the command line. The simplest case is to simply read all of the file contents from standard input as a single stream, which is common when building up pipes. As a matter of fact, this is so common that perl automatically switches to this mode if there are no files listed on the command line:

<Source> | perl -pe "s/Foo/Bar/g" | <Sink>

Here Source and Sink are the commands that generate the raw data and handle the altered output from perl, respectively. Incidently, the filename consisting of a single hyphen (-) is an explicit alias for standard input; this allows the Perl programmer to merge input from files and pipes, like so:

<Source> | perl -pe "s/Foo/Bar/g" header.bar - footer.bar
 | <Sink>

In this example, a header file is read, followed by the input from the pipe source, followed by a footer file; the whole mess is read in, modified, and sent through to the out pipe. Still, as was mentioned early on, when dealing with multiple files it is usually desirable to keep the files separate, by using inplace editing or by explicitely handling each file separately. On the other hand, it can be a pain to list all of the files on the command line, especially if there are a lot of files, or they are generated programmatically. The simplest method is to simply read the files from standard input, pushing them onto @ARGV in a BEGIN block; this has the effect of tricking perl into thinking it received all of the filenames on the command line! Assuming the common case of one filename per input line, the following will do the trick:

<FilenamesSource> | perl -pi.bak -e "BEGIN {push @ARGV,
 <STDIN>; chomp @ARGV} s/Foo/Bar/g"

Here we once again use the shortcut that reading in a file in a list context (which is provided by the push) will read in the entire file; the entire contents are added, one filename per entry, to the @ARGV array, which normally contains the list of arguments to the script. To complete the trick, we chomp the line endings from the filenames, since Perl normally returns the line ending characters (a carriage return and/or a line feed) when reading lines from a file, and we don't want to consider these to be part of the filenames. (On some platforms, you could actually have filenames containing line ending characters, but then you'd have to make the Perl code a little more complex, and you deserve to figure that out for yourself for trying it in the first place.)

Response files

Another common design is to provide filenames on the command line as usual, but filenames starting with an @ are treated specially; their contents are considered to be a list of filenames to insert directly into the command line. For example, if the contents of the file names.baz (often called a response file) are:

two
three
four

then this command:

perl -pi.bak -e "s/Foo/Bar/g" one @names.baz five

should be treated as exactly equivalent to:

perl -pi.bak -e "s/Foo/Bar/g" one two three four five

To make this work, we once again need to do a little magic in a BEGIN block. Essentially, we want to parse through the @ARGV array, looking for filenames that begin with @. We pass through any unmarked filenames, but for each response file found, we read in the contents of the response file and insert the new list of filenames into @ARGV. Finally, we chomp the line endings, just as in the previous section; we then have a canonical file list in @ARGV, just as if all of the files had been specified on the command line. Here's what it looks like in action:

perl -pi.bak -e "BEGIN {@ARGV = map {s/^@// ? @{open RESP,
 '<', $_; [<RESP>]} : $_} @ARGV; chomp @ARGV} s/Foo/Bar/g"
 <ResponseFileList> 

Here's the same code with line breaks added so you can see what's going on:

perl -pi.bak -e "
        BEGIN {
            @ARGV = map {
                        s/^@// ? @{open RESP, '<', $_;
                                   [<RESP>]}
                               : $_
                    } @ARGV;
            chomp @ARGV
        }
        
        s/Foo/Bar/g
    " <ResponseFileList> 

The only tricky part is the map block. map applies a piece of code to every element of a list, returning a list of the return values of the code; the current element is represented as $_. The block we're using here checks to see if it was able to remove a @ from the beginning of each filename. If so, it opens the file, reads the whole thing into an anonymous temporary array (that's what the square brackets are there for), and then inserts that array instead of the response file's name (that's the odd @{...} construct). If there was no @ at the beginning of the filename to remove, the filename is copied directly into the map results. Once we've performed this expansion, and chomped any line endings, we can then get on with the main work, which in this case is simply our usual substitution, s/Foo/Bar/g.

Recursing directories

For our final example, let's deal with a major weakness in the way we've been doing things so far -- we're not recursing into directories, but merely expecting all of the files we need to read to be listed explicitely on the command line. To perform the recursion, we need to pull out the big guns: File::Find, which is a Perl module that provides very powerful recursion methods; it comes standard with any recent version of the perl interpreter. The command line will be deceptively simple, because all of the brains will be in the script:

perl cleanup.pl <DirectoryList>

This script will perform some basic housecleaning, marking all files readable and writeable, removing those with the extensions .bak, .$$$, and .tmp, and cleaning up .log files. For the log files, we will create a master log file for archiving or perusal, containing the contents of all of the other logs, and then delete the logs so that they remain short over time. Here's the script:

use File::Find;

die "All arguments must be directories!"
    if grep {!-d} @ARGV;
open MASTER, '>', 'master.lgm';
finddepth(\&filehandler, @ARGV);
close MASTER;
rename 'master.lgm', 'master.log';

sub filehandler
{
    chmod stat(_) | 0666, $_ unless (-r and -w);
    unlink if (/\.bak$/ or /\.tmp$/ or /\.\$\$\$$/);
    if (/\.log$/) {
        open LOG, '<', $_;
        print MASTER "\n\n****\n$File::Find::name\n****\n";
        print MASTER <LOG>;
        close LOG;
        unlink;
    }
}

This example shows just how powerful Perl and Perl modules can be, and at the same time just how obtuse Perl can appear without some experience with it. In this case, the short explanation is that the finddepth() function iterates through all of the program arguments (@ARGV), recursing into each directory, and calling the filehandler() subroutine for each file. That subroutine then can examine the file and decide what to do with it. In the example, we check for readability and writability with -r and -w, fixing the file's security settings if needed with chmod. We then unlink (delete) any file with a name ending in any of the three unwanted extensions. Finally, if the extension is .log, we open the file, write a few header lines to the master log, copy the file into the master log, close it, and delete it.

Instead of using finddepth(), which does a depth-first search of the directories and visits them from the bottom up, we could have used find(), which does the same depth-first search, but visits them from the top down. As a side note, the master log file is written with the extension .lgm, and then renamed at the end to have the extension .log, so as to avoid the possibility of writing the master log into itself if the current directory is one of those searched.

And that's it. Sure, there's a lot more that could be done with these examples, including error checking, additional statistics, help text, etc. If you want to learn how to do this, get a copy of Programming Perl, 3rd Edition, by Larry Wall, Tom Christiansen, and Jon Orwant. This is the bible (or the camel, rather) of the Perl community, and well worth the read. Good luck!