Task 1 String Format
You are given a string and a positive integer. Write a script to format the string, removing any dashes, in groups of size given by the integer. The first group can be smaller than the integer but should have at least one character. Groups should be separated by dashes.
-
Example 1
- Input:
$str = "ABC-D-E-F"
,$i = 3
- Output:
"ABC-DEF"
- Input:
-
Example 2
- Input:
$str = "A-BC-D-E"
,$i = 2
- Output:
"A-BC-DE"
- Input:
-
Example 3
- Input:
$str = "-A-B-CD-E"
,$i = 4
- Output:
"A-BCDE"
- Input:
Thought process
The obvious first step is going to be to remove any dashes that might already be present.
Then, we’ll want to form groups of $i
, starting from the right end of the string — or reverse the string to start from the left.
$str = reverse $str =~ s/-+//gr;
The r
modifier on the substitution is going to return the result of the operation, which can then be fed straight into the reverse
function. Without it, the s///
operation would return the number of matches, which is not useful this week.
There are at least three ways to operate on groups of characters: (1) regular expression match; (2) substring operators; (3) some kind of split operation. Like an aspiring Pokemon trainer, let’s try to catch ’em all.
Solution the first: regular expressions
A simple global replacement should work: every group of i
characters can be replaced by itself plus a dash, and then we can reverse the string again to get back its original order.
$str = reverse $str =~ s/.{$i}/$&-/gr;
-
.{$i}
is regex fori
occurrences of any character -
$&
is regex for “whatever matched”
There’s an untidy detail: if the length of the string is a multiple of i
, this will leave us with an extra dash at the end of the string. I’m going to speculate that there’s a regular expression involving a look-ahead assertion to test for the end-of-string, but I’m going to do the easy thing and simply trim a leading dash if that’s what we ended up with.
sub strFmtRE($str, $i)
{
$str = reverse $str =~ s/-+//gr;
$str = reverse $str =~ s/.{$i}/$&-/gr;
return $str =~ s/^-//r;
}
Solution the second: substrings
From the length of the string, we can divide by $i
to
calculate the number of dashes to insert, and then we can insert them into the string at regular intervals, $i
characters apart.
sub strFmtSubstr($str, $i)
{
$str = reverse $str =~ s/-+//gr;
my $d = int( (length($str) - 0.5) / $i);
for my $p ( map { $_ * $i } reverse 1 .. $d )
{
substr($str, $p, 0, '-');
$logger->debug("i=$i p=$p str=$str");
}
return reverse $str;
}
- The first line once again removes dashes and reverses the string.
-
$d
is the number of dashes we need to insert into the string. - In the
for
statement from right to left:-
reverse 1 .. $d
creates a sequence that counts down from$d
to 1. Unlike Python, we can’t specify a descending sequence in Perl; it has to be the reverse of an ascending sequence. -
map { $_ * $i }
transforms that sequence into a set of offsets where dashes are needed
-
-
substr($str, $p, 0, '-')
inserts the'-'
character. It selects a 0-length string at position$p
, and then replaces it with the fourth argument. - And this is why the loop counts down instead of up: every time we insert a character, the length of the string grows. If we operated from left to right, we would be changing the positions where dashes are needed. By operating from right to left, the fact that the end of the string is getting longer doesn’t affect the decreasing values of
$p
.
It’s not really necessary to reverse the string in this solution. We could calculate $p
as offsets from the end of the string. But we didn’t.
Solution the third: unpack
As it happens, over the weekend I was looking at the draft second edition of Dave Cross’s book “Data Munging in Perl”. That left the unpack
function hovering around in my brain, which is a particularly easy solution to this problem, except for (a) the part where you have to be aware that the unpack
function exists, and (b) the part where you have to realize that 90% of the rather daunting pack/unpack tutorial is irrelevant for this.
unpack
uses its own pattern language to describe templates for data. When a string matches the template, unpack
hands you an array of matching parts. A string of 4 characters is represented by the pattern A4
. A repeating set of 4-character substrings is represented as (A4)*
.
sub strFmtUnpack($str, $i).
{
return scalar
reverse
join("-",
unpack("(A$i)*",
reverse $str =~ s/-//gr));
}
Let’s read this backwards.
- At the end, we have the same reversal of the string with its dashes deleted. This will become the second argument of
unpack
. - The template argument for
unpack
is"A($i)*"
, to select groups of$i
characters. - Unpack will return an array of
i
-character strings. It turns out that the*
in the template ofunpack
will give us the trailing bit of$str
that is less than$i
long as a bonus array element. How lucky for us. -
join
puts the dashes in where we want them - except that the string is backwards now, so we need to
reverse
it again - and we need to force scalar context because otherwise
reverse
likes to operate on array elements, and we need it to work on the strings in its arguments.
Battle of the solutions
Time to benchmark. Last week, we saw that a regular expression solution was a clear winner. I’ll make up a long-ish string of, say, 1000 characters, and ask to format it in some random size chunks. My environment is a five-year-old MacBook Pro M1 laptop, running Perl 5.40 on MacOS Sequoia.
sub runBenchmark($repeat)
{
use Benchmark qw/cmpthese/;
my $str = "xyzzy" x 200;
my $i = 23;
cmpthese($repeat, {
regex => sub { strFmtRE($str, $i) },
substr => sub { strFmtSubstr($str, $i) },
unpack => sub { strFmtUnpack($str, $i) },
});
}
Drum roll please:
$ perl ch-1.pl -b 200000
Rate substr regex unpack
substr 54054/s -- -54% -81%
regex 117647/s 118% -- -58%
unpack 281690/s 421% 139% --
Not too surprisingly, unpack
, which was designed for this kind of thing, kicks butt.