Blackrabbit Coding challenge: finding the median letter

Internship assignment, blackrabbit, how to find the median letter

4 views
d

By. Jacob

Jacob Kristensen (Turbulentarius) is a Web Developer based in Denmark. He is currently pursuing a Bachelor's degree in Web Development at Zealand, focusing on learning React and refining his existing skills.

Edited: 2025-07-05 12:52

Recently, I was asked to do a coding challenge as part of an internship application — which is very odd, since I was just looking for an unpaid internship in connection with my bachelor's in web development. I was certainly not expecting a company to be fishing for the best they could find.

The challenge was named black_rabbit and was seemingly developed some 9–10 years ago by a certain Danish web company that shall remain unnamed in this post.

I had to determine the median value in five different texts to make a bunch of PHPUnit tests pass. There was no mention of what kind of median they wanted me to find (a median is not just a median). After an epic guessing session, I determined they wanted me to find the median in a list of letters, sorted by letter frequency (number of occurrences of each letter in the text). This should be sorted from lowest to highest frequency, and then the median in this array of letters should be picked. The result is 46 total letters when taking into account Unicode characters and filtering out capital letters:

array(46) {
  ["ò"]=> int(1)
  ["ñ"]=> int(1)
  ["ù"]=> int(1)
  ["ö"]=> int(1)
  ["ó"]=> int(1)
  ["û"]=> int(1)
  ["à"]=> int(2)
  ["ì"]=> int(2)
  ["ü"]=> int(2)
  ["â"]=> int(4)
  ["ï"]=> int(4)
  ["ë"]=> int(4)
  ["ä"]=> int(4)
  ["ê"]=> int(5)
  ["è"]=> int(12)
  ["é"]=> int(21)
  ["æ"]=> int(21)
  ["ú"]=> int(78)
  ["í"]=> int(140)
  ["á"]=> int(413)
  ["q"]=> int(698)
  ["z"]=> int(858)
  ["x"]=> int(877)
  ["j"]=> int(1131)
  ["v"]=> int(5860)
  ["k"]=> int(6199)
  ["b"]=> int(11514)
  ["p"]=> int(11624)
  ["y"]=> int(13827)
  ["g"]=> int(14000)
  ["c"]=> int(15902)
  ["w"]=> int(17823)
  ["f"]=> int(18122)
  ["u"]=> int(18705)
  ["m"]=> int(19857)
  ["l"]=> int(28718)
  ["d"]=> int(36645)
  ["r"]=> int(41793)
  ["s"]=> int(45473)
  ["i"]=> int(49504)
  ["o"]=> int(51296)
  ["n"]=> int(53264)
  ["h"]=> int(54184)
  ["a"]=> int(63735)
  ["t"]=> int(65270)
  ["e"]=> int(91867)
}

Surprisingly, the fifth text kept failing; according to their predefined results in the PHPUnit test, the fifth median should be "z" — but I kept getting "x" for the lower median and "j" as the higher median.

It turned out their original test was flawed; the only way I could get "z" as the median while having the previous tests pass was by using PHP's non-Unicode functions to work on the strings!

Sure enough, when I looked over the ancient pull requests on their ridiculously public repository, there was an old solution that relied on these functions, where all the test cases were passing. Yet, it was clearly counting capitalized versions of UTF-8 characters that should have been converted to lowercase.

So, seemingly, their version has 49 letters in the array (50 total items, due to zero-indexing) because they also counted some of the capitalized versions of UTF-8 characters — e.g., Á, É, Æ.

They were probably using strtolower instead of mb_strtolower.

$content = file_get_contents($filePath);
$letters = preg_replace('/[^\p{L}]/u', '', $content);
return strtolower($letters);

In this solution, they correctly used preg_replace with the pattern [^\p{L}]/u. This is identical to the pattern I came up with — and mind you, I also tried iterating over the entire input, which is extremely slow compared with the regular expression.

However, the problem seems to be their use of strtolower(), which doesn't work with Unicode characters, and you end up with a mix of lowercase and uppercase letters. But you might not realize it, because when also relying on str_split, you effectively corrupt the resulting array — UTF-8 characters turning into unreadable gibberish.

Text file: text5.txt

Here is my solution:

$filePath = 'txt/text5.txt';

if (!file_exists($filePath)) {
  throw new \RuntimeException("File does not exist: $filePath");
}

// Open the file for reading
$handle = fopen($filePath, 'r');
if (!$handle) {
  throw new \RuntimeException("Cannot open file: $filePath");
}

// Attempt to obtain a shared lock with timeout
$startTime = microtime(true);
$locked = false;

// Use LOCK_NB to avoid occupying the thread for too long
do {
  $locked = flock($handle, LOCK_SH | LOCK_NB);
  if (!$locked) {
    usleep(100000); // 100 ms
  }
} while (!$locked && (microtime(true) - $startTime) < 2.0);

if (!$locked) {
  fclose($handle);
  throw new \RuntimeException("Could not obtain shared lock on file within 2 seconds: $filePath");
}

$letters = [];
while (($chunk = fread($handle, 8192)) !== false && $chunk !== '') {
  $chunk = mb_strtolower(
    preg_replace('/[^\p{L}]/iu', '', $chunk), // Match any character in the chunk classified as a letter
    'UTF-8'
  );
  // Return an array containing the characters from the chunk (Works with Unicode)
  $charList = preg_split('//u', $chunk, -1, PREG_SPLIT_NO_EMPTY);

  $length = count($charList);
  for ($i = 0; $i < $length; $i++) {
    $char = $charList["$i"];
    if (!isset($letters[$char])) {
      $letters[$char] = 0;
    }
    $letters[$char]++;
  }
}
// Sort by ascending letter frequency
asort($letters);

flock($handle, LOCK_UN);
fclose($handle);

// Sorting done; now we look for the median

$totalLetters = count($letters);

if ($totalLetters < 1) {
  throw new \RuntimeException('No letters to count.');
}

// If even
if ($totalLetters % 2 == 0) {
  $median = $totalLetters / 2 - 1; // Pick the lower middle (-1) for even results
} else {
  $median =  (int)($totalLetters / 2); // Pick the middle element of an uneven result
}

// This part avoids iterating over the $letters array to find the middle
$keys = array_keys($letters); // Get the keys (letters) while preserving array order
$letter = $keys["$median"];
$count = $letters["$letter"];

$occurrences = $count;
echo "\n  Median: " . $letter . "\n";

My result:

m (high)f (low)
w (high)m (low)
w (high)g (low)
w (middle)uneven - not applicable
x (high)j (low)

Tell us what you think: