Skip to main content


Showing posts from 2011

Are your files SM? M? XL? the procedural XL solution

I presented this challenge at work as a beat-the-winter-blahs activity.

But then I felt I couldn't present a merely operative solution. As the local Perl geek, I felt required to demonstrate that Perl could do clean, well written solutions, not just line noise. I wanted a balance between advanced Perl features and easy comprehension. On the Unix command line, I prefer piping commands together, rather than using a bunch of intermediate files. Similarly in Perl I advocate using map{} and grep{}, and feeding the output of one function into the input of another. I consider those standard idioms any programmer needs to learn.

On the other hand, there were a couple of situations where exceptions and oddities in the code made me uncomfortable. Pulling things into subroutines such as calc_field_width() simplified things and made the code clearer, I think. I guess the lesson of that is that short subroutines are better. I knew that!

I'm open to suggestions for alternative ways of doing…

Are your files SM? M? L? XL? Mid-Size

What I actually had in mind when I came up with the challenge was something like the following ... the sort of thing you find in SysAdmin magazine or Randall Schwartz's columns.


use 5.010;
use strict;
use warnings;

use Readonly;

Readonly my $DOT => q{.};
Readonly my $DOTDOT => q{..};
Readonly my $ESCAPE_BACKSLASH => q{\\};

die "USAGE: $0 rootdir.\n" unless @ARGV;

my @dirs = ( @ARGV );

my %stats;
while (@dirs) {
my $one_dir = shift @dirs;
$one_dir =~ s{(\s)}{$ESCAPE_BACKSLASH$1}g; # escape spaces for glob()

while ( my $entry = glob "$one_dir/*" ) {
next ENTRY if $entry eq $DOT or $entry eq $DOTDOT;
if ( -d $entry ) {
push @dirs, $entry;
else {
my $size = -s _;
my $len = $size == 0 ? 0 : length $size;

for my $size ( sort { $a <=> $b} keys %stats ) {
my $maxsize = 10**$size;
say sprintf( '<%8d %d', $maxsize, $stats{$size})…

Are your files SM? M? L? XL? Kwick-N-EZ

When I first thought up the programming exercise I described last week in Are your files SM? M? L? XL?, my intention was to have a trivial exercise for applicants to carry out. HR was passing through lots of applicants who had detailed database knowledge, but were not at all programmers. They couldn't name simple Unix commands, couldn't talk about how to carry out a task in Perl or shell or as a pipeline of Unix commands. I thought this exercise would be simple for any experienced programmer to carry out, never mind style or performance points.

Shortly after I came up with the idea, I realized it could mostly be done as a Unix pipeline.

find ~/ -type f -printf "%s\n" |\
perl5.10 -n -E 'say length' |\
sort |\
uniq -c |\
perl5.10 -n -E ' |\
$fill, $count, $size) = split /\s+/; |\
$exp = 10**($size-1) |\
say "$exp $count" '

Although I hadn't used the option before, man find in…

Are your files SM? M? L? XL?

Twenty years ago I was on a co-op work term where a Sun Sparc 10 was shared among 6 NCD X-terminals. The system had 32 MB shared among the users, with 1 GB hard drive total storage. Today I have 12 GB memory, many terabytes hard drive ... programming experiments, video, music, and all my photography. The largest files are larger today than in the past ... I have performance profiling data from the Sudoku experiments I wrote about that are bigger than the total file system I worked with twenty years ago.

But what's more important? small files or large?

Very large files will be rare, otherwise you would run out of space. Very small files may be very important, but even a large number of them will not take up much space. Most space will be devoted to something in between ... but where is the bulk of storage devoted?
The (mental) challenge is to determine how file size is distributed, given some starting directory. My opinion is that exponential categories are appropriate, that is, 10.…