Saturday, March 5, 2011

Are your files SM? M? XL? the procedural XL solution

I presented this challenge at work as a beat-the-winter-blahs activity.

But then I felt I couldn't present a merely operative solution. As the local Perl geek, I felt required to demonstrate that Perl could do clean, well written solutions, not just line noise. I wanted a balance between advanced Perl features and easy comprehension. On the Unix command line, I prefer piping commands together, rather than using a bunch of intermediate files. Similarly in Perl I advocate using map{} and grep{}, and feeding the output of one function into the input of another. I consider those standard idioms any programmer needs to learn.

On the other hand, there were a couple of situations where exceptions and oddities in the code made me uncomfortable. Pulling things into subroutines such as calc_field_width() simplified things and made the code clearer, I think. I guess the lesson of that is that short subroutines are better. I knew that!

I'm open to suggestions for alternative ways of doing things, if you have better ideas.

The program goes through the file hierarchy, storing data in a data structure , %stats.

0  HASH(0x1009a1000)
   3 => HASH(0x1009d0c90)
      'N' => 32
      'sum' => 16995
   4 => HASH(0x1009d0e40)
      'N' => 17
      'sum' => 57691
   5 => HASH(0x1009d0ea0)
      'N' => 3
      'sum' => 35747


The '3', '4', '5 refer to the number of digits in the file size, i.e., '3' refers to file sizes in the range 100-999. 'N' is the number of files in this size range,'sum' is the sum of all the actual file sizes for the range.

The finale output shows there are 166 zero-byte files ... perhaps they were created but never populated, due to some error, or perhaps they communicated some binary data merely by their presence. Sometimes compilers use zero-length files to indicate the extent of their progress. Files with sizes between 1 and 9 bytes number 9, while 327 are under 100 bytes. This blog is too narrow, so I cut the table short, but I do have four 45GB files which store data comparing the performance ofsudoku-solving algorithms.


            size     N           sum           avg
            ----     -           ---           ---
<              1   166             0             0
<             10    59           411             7
<            100   327        17,840            55
<          1,000 3,800     1,939,738           510
<         10,000 8,231    31,002,459         3,767
<        100,000 3,575   109,346,315        30,586



#!/usr/local/bin/perl5.10
# Determine file counts for size categories 1-9, 10-99
#

use warnings;
use strict;
use feature ':5.10';

use Readonly;
use Getopt::Long;
use Data::Dumper;

# --------------------------------------------------------
# Constants
#

Readonly my $DOT              => q{.};
Readonly my $DOTDOT           => q{..};
Readonly my $ESCAPE_BACKSLASH => q{\\};
Readonly my $LASTFIELD        => -1;
Readonly my $SPACE            => q{ };
Readonly my $TAB              => qq{\t};

Readonly my $COMMIFY_WIDTH   => 3;
Readonly my $ROUNDING_OFFSET => 1.0 / 2;

# -------------------------------------------------------
# Main
#
my %options = ( root => q{~} );

GetOptions( \%options, 'root=s' )
  or die "Error parsing command line arguments.\n";

my $stats = process_sizes( $options{root} );
say Dumper($stats) if exists $ENV{DEBUG};
display_summary($stats);

# -------------------------------------------------------
# Subroutines
#

# Explore directories, accumulating file sizes. Initialize
# dirs list with the one root, add subdirs, processing
# recursively.
#
sub process_sizes {
    my (@dirs) = @_;

    my %stats;
    while (@dirs) {
        process_one_dir( \%stats, \@dirs );
    }
    return \%stats;
}

# ........................................................
# For any directory, use glob() to get a list of all the
# files in the directory.  Ignore '.' & '..'.
#
# Spaces in directory names need to be escaped, otherwise
# glob() will think the string contains multiple 
# candidates to match.
#
sub process_one_dir {
    my ( $stats, $dirs ) = @_;

    my $one_dir = shift @{$dirs};
    say "process_one_dir => $one_dir"
        if $ENV{DEBUG};
    $one_dir =~ s{(\s)}{$ESCAPE_BACKSLASH$1}g;

  ENTRY:
    while ( my $entry = glob "$one_dir/*" ) {
       next ENTRY
          if $entry eq $DOT or $entry eq $DOTDOT;
        process_one_entry( $entry, $dirs, $stats );
    }
}

# ........................................................
# Stick subdirectories onto the @dirs array for recursive
# searching.  File require additional processing based on
# their size.
#
# Note: specify the file name only on the first stats()
# call on a file ( -d or other -X ). Use '_' as the file
# name in additional calls to re-use the data rather than
# making a new call.
#
sub process_one_entry {
    my ( $entry, $dirs, $stats ) = @_;

    if ( -d $entry ) {
        push @{$dirs}, $entry;
    }
    else {
        say "${TAB}record_one_size => $entry"
            if $ENV{DEBUG};
        record_one_size( $stats, -s _ );
    }
}

# ........................................................
# For empty files, update a special '0' category. Use the
# length of the size of non-empty files to determine the
# category ( equivalent to ceiling( log10( size ))), and
# keep track of the number of files and the sum of the
# bytes for the category.
#
sub record_one_size {
    my ( $stats, $size ) = @_;

    my $len = $size == 0 ? 0 : length $size;
    $stats->{$len}{N}++;
    $stats->{$len}{sum} += $size;
}

# ........................................................
#
sub print_header {
    my ($fmt) = @_;

    my @fields = (qw/size N sum avg/);
    printf $fmt, @fields;
    printf $fmt, map { q{-} x length $_ } @fields;
}

# .........................................................
# For each size category, get desired field, either 'N'
# or 'sum'. Sort these and grab the largest value to
# determine how wide the field needs to be.
#
sub calc_field_width {
    my ( $stats, $tags, $field ) = @_;

    my @values = map { $stats->{$_}{$field} } @{$tags};
    my $width = length( ( sort { $a <=> $b }
                        @values )[$LASTFIELD] );
    return $width;
}

# ........................................................
# Field widths need an extra 33% to allow for commafying
# numbers. The calulation will try to put a comma to the
# left of the entire number if leftmost group has three
# digits, so drop one comma in that case. But don't drop
# comma if the width is zero, because that would mmake the
# width negative.
#
sub allow_for_commas {
    my ($width) = @_;

    my $commas = int $width / $COMMIFY_WIDTH;
    $commas--
        if $commas && $width % $COMMIFY_WIDTH == 0;
    return $width + $commas;
}

# ........................................................
# The keys of the hash are the exponents of the size
# categories, i.e., the log of the size, 0, 1, 2, ... Sort
# these and take the largest to determine how wide size
# field has to be. Determine the width for the N and sum
# fields. Reuse the 'sum' field width for the average,
# since it can only be less or the same width.
#
# To generate the format string, take an array of the
# %fwidth keys and use map{} to extract the corresponding
# fwidth value, stick them inside the format '%s' and join
# the various pieces using spaces. Add 33% padding to
# widths to allow for commas.
#
sub gen_fmt {
    my ($stats) = @_;

    my @tags = sort { $a <=> $b } keys %{$stats};
    my %fwidth;
    # size is N zeroes plus leading '1'
    $fwidth{size} = 1 + $tags[$LASTFIELD];
    $fwidth{count} = calc_field_width $stats, \@tags, 'N');
    $fwidth{sum}   = calc_field_width $stats, \@tags, 'sum');

    my $fmt = join $SPACE,
              map { '%' . allow_for_commas( $fwidth{$_} ) . 's' }
                  (qw/size count sum sum/);
    $fmt .= "\n";

    return ( $fmt, \@tags );
}

# ........................................................
# commify a number. Traditional solution see 'perldoc
# perlfaq5 q.13'
#
sub commify {
my $text = reverse $_[0];
$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}

# ........................................................
#
sub print_line {
    my ( $fmt, $stats, $log ) = @_;

    my $size = '1' . '0' x $log;
    printf $fmt, map { commify($_) } (
        $size,
        $stats->{$log}{N},
        $stats->{$log}{sum},
        int $ROUNDING_OFFSET
            + $stats->{$log}{sum} / $stats->{$log}{N}
    );
}

# ........................................................
# All the field widths need to be dynamic, since largest
# file might be terabytes in size, there could be millions
# of files in some size range, the sum could be huge

# Calculate the format string once, then reuse it with 
# leading spaces for the header, and a leading '< ' for 
# the actual sizes. Everything is ( or can be ) a string in 
# perl. 
#  
sub display_summary {
    my ($stats) = @_; 


    my ( $fmt, $tags ) = gen_fmt($stats); 
    print_header(" $fmt");


    for my $width ( @{$tags} ) { 
        print_line( "< $fmt", $stats, $width ); 
    } 

# -------------------------------------------------------- 
# End of File

Tuesday, February 22, 2011

Are your files SM? M? L? XL? Mid-Size

What I actually had in mind when I came up with the challenge was something like the following ... the sort of thing you find in SysAdmin magazine or Randall Schwartz's columns.



#!/usr/local/bin/perl5.10

use 5.010;
use strict;
use warnings;

use Readonly;

Readonly my $DOT => q{.};
Readonly my $DOTDOT => q{..};
Readonly my $ESCAPE_BACKSLASH => q{\\};

die "USAGE: $0 rootdir.\n" unless @ARGV;

my @dirs = ( @ARGV );

my %stats;
while (@dirs) {
my $one_dir = shift @dirs;
$one_dir =~ s{(\s)}{$ESCAPE_BACKSLASH$1}g; # escape spaces for glob()

ENTRY:
while ( my $entry = glob "$one_dir/*" ) {
next ENTRY if $entry eq $DOT or $entry eq $DOTDOT;
if ( -d $entry ) {
push @dirs, $entry;
}
else {
my $size = -s _;
my $len = $size == 0 ? 0 : length $size;
$stats{$len}++;
}
}
}

for my $size ( sort { $a <=> $b} keys %stats ) {
my $maxsize = 10**$size;
say sprintf( '<%8d %d', $maxsize, $stats{$size}); }


Started with some directory specified as a command-line argument, process all the directory contents: ignore '.' and '..'; add directories to the queue of directories waiting to be processed, and get the log10 size of any file, incrementing the associated count.

For all the sizes encountered, in increasing order, convert to a (unreachable) max size, and print the size and the number of files in that range.


I can do without the File::Find module, the task at hand is pretty simple. On the other hand, my tolerance for ugly punctuation has dropped in the past few years, so I need the Readonly. Without that, it becomes ...


my %stats;
while (@dirs) {
my $one_dir = shift @dirs;
$one_dir =~ s{(\s)}{\\$1}g; # escape spaces for glob()

ENTRY:
while ( my $entry = glob "$one_dir/*" ) {
next ENTRY if $entry eq q{.} or $entry eq q{..};


The dots would be more tolerable with an SQL 'in' operator, or a Perl6 Junction:



use Perl6::Junction qw/any/;
...
ENTRY:
while ( my $entry = glob "$one_dir/*" ) {
next ENTRY if $entry eq any( q{.}, q{..} );


Using a subroutine to localize the ugliness would make the double escape bearable.



sub escape_space { $_->[0] =~ s{(\s)}{\\$1}g; };

my %stats;
while (@dirs) {
my $one_dir = escape_space shift @dirs;
ENTRY:
while ( my $entry = glob "$one_dir/*" ) {
next ENTRY if $entry eq any( q{.}, q{..} );



So the final result is down to 35 lines, including blanks and closing curlies.

#!/usr/local/bin/perl5.10

use 5.010;
use strict;
use warnings;
use Perl6::Junction qw/any/;

sub escape_space { $_->[0] =~ s{(\s)}{\\$1}g; };

die "USAGE: $0 rootdir.\n" unless @ARGV;

my @dirs = ( @ARGV );

my %stats;
while (@dirs) {
my $one_dir =escape_space shift @dirs;

ENTRY:
while ( my $entry = glob "$one_dir/*" ) {
next ENTRY if $entry eq any( q{.}, q{..} );
if ( -d $entry ) {
push @dirs, $entry;
}
else {
my $size = -s _;
my $len = $size == 0 ? 0 : length $size;
$stats{$len}++;
}
}
}

for my $size ( sort { $a <=> $b} keys %stats ) {
my $maxsize = 10**$size;
say sprintf( '<%8d %d', $maxsize, $stats{$size});

Are your files SM? M? L? XL? Kwick-N-EZ

When I first thought up the programming exercise I described last week in Are your files SM? M? L? XL?, my intention was to have a trivial exercise for applicants to carry out. HR was passing through lots of applicants who had detailed database knowledge, but were not at all programmers. They couldn't name simple Unix commands, couldn't talk about how to carry out a task in Perl or shell or as a pipeline of Unix commands. I thought this exercise would be simple for any experienced programmer to carry out, never mind style or performance points.

Shortly after I came up with the idea, I realized it could mostly be done as a Unix pipeline.



find ~/ -type f -printf "%s\n" |\
perl5.10 -n -E 'say length' |\
sort |\
uniq -c |\
perl5.10 -n -E ' |\
$fill, $count, $size) = split /\s+/; |\
$exp = 10**($size-1) |\
say "$exp $count" '



Although I hadn't used the option before, man find indicated that find could indeed return the size of the file and nothing else. Trying to write this article on my home machine, I discovered that is a characteristic of GNU find, not available on the Mac. So on other machines you may need to do more, maybe use ls -l or have find print out the number of blocks a file takes up ... less accurate, less complete, but sufficient for a quick proof of concept.



So find is printing a series of file sizes, one per line. My original thought was to take the logarithm of the size and truncate to an integer. But Perl will only calculate loge, so I would need to manually multiple that by loge 10. After clobbering myself over the head for ten minutes trying to achieve that, I realized that the number of digits in the size IS the upper limit of the integer portion of log10. perl -n reads the input line by line, and applies the -e expression to each line. Specifying perl5.10 (or later) and using -E instead of -e allows me to use say instead of print, saying two characters in the command name, avoiding a \n and sparing an explicit $_. I SHOULD chomp the newline off the input before getting it's length, but I can simply subtract 1. I could subtract the character now, but I found it easier to do it later.


The output of the Perl component is a series of lines, each with a number specifying how many digits appear in the file length. sort orders them, obviously, and uniq -c replaces multiple instances of a value with a single instance and the number of times that value appears.


Little Lord Flaunteroy would chomp off the newlines at the end of each line, and eliminate the leading spaces used by uniq -c. But I'm planning to split each line on space characters, to separate the count and value fields. By splitting on one-or-more spaces, the leading spaces, however many there may be, generate a single leading field with no data, which I just ignore. In real code I would parenthesize the right hand expression and use square brackets to slice off the values I want. In a one-liner, it's simpler to add a dummy variable. Use the digit-count as an exponent to obtain an unreachable upper limit ... don't forget to drop the value by one, to make up for counting the newline a few stages back. A test with an empty file, or at least one with less than ten characters in it, will remind you to make that adjustment. All that's left is to output the results.

Wednesday, February 16, 2011

Are your files SM? M? L? XL?

Twenty years ago I was on a co-op work term where a Sun Sparc 10 was shared among 6 NCD X-terminals. The system had 32 MB shared among the users, with 1 GB hard drive total storage. Today I have 12 GB memory, many terabytes hard drive ... programming experiments, video, music, and all my photography. The largest files are larger today than in the past ... I have performance profiling data from the Sudoku experiments I wrote about that are bigger than the total file system I worked with twenty years ago.

But what's more important? small files or large?

Very large files will be rare, otherwise you would run out of space. Very small files may be very important, but even a large number of them will not take up much space. Most space will be devoted to something in between ... but where is the bulk of storage devoted?

The (mental) challenge is to determine how file size is distributed, given some starting directory. My opinion is that exponential categories are appropriate, that is, 10..99 bytes, 100..999 bytes, 1000..9999 bytes, etc. Categorizing and formatting is a personal choice, determined in part by what is convenient, so long as useful information is displayed.