Finding nonprintable characters with a test

written by paul on September 30th, 2008 @ 07:26 PM

Our current application includes a lot of static content created by content editors. They check in static HTML files, and we include these files in various parts of the application. The problem is that they sometimes copy and paste from applications such as Outlook or Word, which can introduce unprintable characters into the application. These characters show up strangely on the website.

After this happened a couple of times, we decided to write a test to ensure that we would always catch the unprintable characters:


class NonPrintableCharactersTest < Test::Unit::TestCase
  def test_for_non_printable_characters_in_content
    assert_equal "", `find #{RAILS_ROOT}/content -name '*.html' | xargs grep -n '[^[:space:][:print:]]'`
  end
end

We use find to get a list of all of the html files in the content folder. Then, we pipe this to grep, using the regular expression

'[^[:space:][:print:]]'
which matches anything except spaces or printable characters. The output of this test looks like:


Loaded suite test/non_printable_characters_test
Started
F
Finished in 0.86005 seconds.

  1) Failure:
test_for_non_printable_characters_in_content(NonPrintableCharactersTest) [test/non_printable_characters_test.rb:5]:
<""> expected but was
<"/some/path/to/content/tmp.html:48:character �</span></p>\n">.

1 tests, 1 assertions, 1 failures, 0 errors

The failure message shows the file and line with the character, so it is easy to fix.

Comments

  • Thom Parkin on 01 Oct 15:38

    Simply BRILLIANT!! This illustrates the 'magic' of Ruby that makes it so much fun.
  • Mark on 02 Oct 12:36

    @Thom: This is actually a shell command wrapped in a Ruby test. It can be adapted to any scripting language.
  • Michael Graff on 28 Feb 00:19

    Just note that the find is not entirely safe. I'd add "-X" to the command line: -X The -X option is a modification to permit find to be safely used in conjunction with xargs(1). If a file name contains any of the delimiting characters used by xargs, a diagnostic message is displayed on standard error, and the file is skipped. The delimiting characters include single (') and double (") quotes, backslash (\), space, tab and newline characters. While generally harmless in your script, if it were a destructive tool (like rm) it could hit the wrong file.

Post a comment

Options:

Size

Colors