Skip to main content

Posts

Showing posts from February, 2011

Are your files SM? M? L? XL? Mid-Size

What I actually had in mind when I came up with the challenge was something like the following ... the sort of thing you find in SysAdmin magazine or Randall Schwartz's columns.


#!/usr/local/bin/perl5.10

use 5.010;
use strict;
use warnings;

use Readonly;

Readonly my $DOT => q{.};
Readonly my $DOTDOT => q{..};
Readonly my $ESCAPE_BACKSLASH => q{\\};

die "USAGE: $0 rootdir.\n" unless @ARGV;

my @dirs = ( @ARGV );

my %stats;
while (@dirs) {
my $one_dir = shift @dirs;
$one_dir =~ s{(\s)}{$ESCAPE_BACKSLASH$1}g; # escape spaces for glob()

ENTRY:
while ( my $entry = glob "$one_dir/*" ) {
next ENTRY if $entry eq $DOT or $entry eq $DOTDOT;
if ( -d $entry ) {
push @dirs, $entry;
}
else {
my $size = -s _;
my $len = $size == 0 ? 0 : length $size;
$stats{$len}++;
}
}
}

for my $size ( sort { $a <=> $b} keys %stats ) {
my $maxsize = 10**$size;
say sprintf( '<%8d %d', $maxsize, $stats{$size})…

Are your files SM? M? L? XL? Kwick-N-EZ

When I first thought up the programming exercise I described last week in Are your files SM? M? L? XL?, my intention was to have a trivial exercise for applicants to carry out. HR was passing through lots of applicants who had detailed database knowledge, but were not at all programmers. They couldn't name simple Unix commands, couldn't talk about how to carry out a task in Perl or shell or as a pipeline of Unix commands. I thought this exercise would be simple for any experienced programmer to carry out, never mind style or performance points.

Shortly after I came up with the idea, I realized it could mostly be done as a Unix pipeline.


find ~/ -type f -printf "%s\n" |\
perl5.10 -n -E 'say length' |\
sort |\
uniq -c |\
perl5.10 -n -E ' |\
$fill, $count, $size) = split /\s+/; |\
$exp = 10**($size-1) |\
say "$exp $count" '



Although I hadn't used the option before, man find in…

Are your files SM? M? L? XL?

Twenty years ago I was on a co-op work term where a Sun Sparc 10 was shared among 6 NCD X-terminals. The system had 32 MB shared among the users, with 1 GB hard drive total storage. Today I have 12 GB memory, many terabytes hard drive ... programming experiments, video, music, and all my photography. The largest files are larger today than in the past ... I have performance profiling data from the Sudoku experiments I wrote about that are bigger than the total file system I worked with twenty years ago.

But what's more important? small files or large?

Very large files will be rare, otherwise you would run out of space. Very small files may be very important, but even a large number of them will not take up much space. Most space will be devoted to something in between ... but where is the bulk of storage devoted?
The (mental) challenge is to determine how file size is distributed, given some starting directory. My opinion is that exponential categories are appropriate, that is, 10.…