PWC 322 String Format

Task 1 String Format

You are given a string and a positive integer. Write a script to format the string, removing any dashes, in groups of size given by the integer. The first group can be smaller than the integer but should have at least one character. Groups should be separated by dashes.

  • Example 1

    • Input: $str = "ABC-D-E-F", $i = 3
    • Output: "ABC-DEF"
  • Example 2

    • Input: $str = "A-BC-D-E", $i = 2
    • Output: "A-BC-DE"
  • Example 3

    • Input:$str = "-A-B-CD-E", $i = 4
    • Output: "A-BCDE"

Thought process

The obvious first step is going to be to remove any dashes that might already be present.

Then, we’ll want to form groups of $i, starting from the right end of the string — or reverse the string to start from the left.

$str = reverse $str =~ s/-+//gr;

The r modifier on the substitution is going to return the result of the operation, which can then be fed straight into the reverse function. Without it, the s/// operation would return the number of matches, which is not useful this week.

There are at least three ways to operate on groups of characters: (1) regular expression match; (2) substring operators; (3) some kind of split operation. Like an aspiring Pokemon trainer, let’s try to catch ’em all.

Solution the first: regular expressions

A simple global replacement should work: every group of i characters can be replaced by itself plus a dash, and then we can reverse the string again to get back its original order.

$str = reverse $str =~ s/.{$i}/$&-/gr;
  • .{$i} is regex for i occurrences of any character
  • $& is regex for “whatever matched”

There’s an untidy detail: if the length of the string is a multiple of i, this will leave us with an extra dash at the end of the string. I’m going to speculate that there’s a regular expression involving a look-ahead assertion to test for the end-of-string, but I’m going to do the easy thing and simply trim a leading dash if that’s what we ended up with.

sub strFmtRE($str, $i)
{
    $str = reverse $str =~ s/-+//gr;
    $str = reverse $str =~ s/.{$i}/$&-/gr;
    return $str =~ s/^-//r;
}

Solution the second: substrings

From the length of the string, we can divide by $i to
calculate the number of dashes to insert, and then we can insert them into the string at regular intervals, $i characters apart.

sub strFmtSubstr($str, $i)
{
    $str = reverse $str =~ s/-+//gr;
    my $d = int( (length($str) - 0.5) / $i);

    for my $p ( map { $_ * $i } reverse 1 .. $d )
    {
        substr($str, $p, 0, '-');
        $logger->debug("i=$i p=$p str=$str");
    }
    return reverse $str;
}
  • The first line once again removes dashes and reverses the string.
  • $d is the number of dashes we need to insert into the string.
  • In the for statement from right to left:
    • reverse 1 .. $d creates a sequence that counts down from $d to 1. Unlike Python, we can’t specify a descending sequence in Perl; it has to be the reverse of an ascending sequence.
    • map { $_ * $i } transforms that sequence into a set of offsets where dashes are needed
  • substr($str, $p, 0, '-') inserts the '-' character. It selects a 0-length string at position $p, and then replaces it with the fourth argument.
  • And this is why the loop counts down instead of up: every time we insert a character, the length of the string grows. If we operated from left to right, we would be changing the positions where dashes are needed. By operating from right to left, the fact that the end of the string is getting longer doesn’t affect the decreasing values of $p.

It’s not really necessary to reverse the string in this solution. We could calculate $p as offsets from the end of the string. But we didn’t.

Solution the third: unpack

As it happens, over the weekend I was looking at the draft second edition of Dave Cross’s book “Data Munging in Perl”. That left the unpack function hovering around in my brain, which is a particularly easy solution to this problem, except for (a) the part where you have to be aware that the unpack function exists, and (b) the part where you have to realize that 90% of the rather daunting pack/unpack tutorial is irrelevant for this.

unpack uses its own pattern language to describe templates for data. When a string matches the template, unpack hands you an array of matching parts. A string of 4 characters is represented by the pattern A4. A repeating set of 4-character substrings is represented as (A4)*.

sub strFmtUnpack($str, $i). 
{
    return scalar
      reverse
        join("-",
          unpack("(A$i)*",
            reverse $str =~ s/-//gr));
}

Let’s read this backwards.

  • At the end, we have the same reversal of the string with its dashes deleted. This will become the second argument of unpack.
  • The template argument for unpack is "A($i)*", to select groups of $i characters.
  • Unpack will return an array of i-character strings. It turns out that the * in the template of unpack will give us the trailing bit of $str that is less than $i long as a bonus array element. How lucky for us.
  • join puts the dashes in where we want them
  • except that the string is backwards now, so we need to reverse it again
  • and we need to force scalar context because otherwise reverse likes to operate on array elements, and we need it to work on the strings in its arguments.

Battle of the solutions

Time to benchmark. Last week, we saw that a regular expression solution was a clear winner. I’ll make up a long-ish string of, say, 1000 characters, and ask to format it in some random size chunks. My environment is a five-year-old MacBook Pro M1 laptop, running Perl 5.40 on MacOS Sequoia.

sub runBenchmark($repeat)
{
    use Benchmark qw/cmpthese/;

    my $str = "xyzzy" x 200;
    my $i   = 23;

    cmpthese($repeat, {
            regex  => sub { strFmtRE($str, $i) },
            substr => sub { strFmtSubstr($str, $i) },
            unpack => sub { strFmtUnpack($str, $i) },
        });
}

Drum roll please:

$ perl ch-1.pl -b 200000
           Rate substr  regex unpack
substr  54054/s     --   -54%   -81%
regex  117647/s   118%     --   -58%
unpack 281690/s   421%   139%     --

Not too surprisingly, unpack, which was designed for this kind of thing, kicks butt.

Leave a Reply