Finding nonprintable characters with a test

Our current application includes a lot of static content created by content editors. They check in static HTML files, and we include these files in various parts of the application. The problem is that they sometimes copy and paste from applications such as Outlook or Word, which can introduce unprintable characters into the application. These characters show up strangely on the website.

After this happened a couple of times, we decided to write a test to ensure that we would always catch the unprintable characters:

class NonPrintableCharactersTest < Test::Unit::TestCase
  def test_for_non_printable_characters_in_content
    assert_equal "", `find #{RAILS_ROOT}/content -name '*.html' | xargs grep -n '[^[:space:][:print:]]'`
  end
end

We use find to get a list of all of the html files in the content folder. Then, we pipe this to grep, using the regular expression

'[^[:space:][:print:]]'

which matches anything except spaces or printable characters. The output of this test looks like:

Loaded suite test/non_printable_characters_test
Started
F
Finished in 0.86005 seconds.

  1) Failure:
test_for_non_printable_characters_in_content(NonPrintableCharactersTest) [test/non_printable_characters_test.rb:5]:
<""> expected but was
<"/some/path/to/content/tmp.html:48:character �\n">.

1 tests, 1 assertions, 1 failures, 0 errors

The failure message shows the file and line with the character, so it is easy to fix.