Finding nonprintable characters with a test
Our current application includes a lot of static content created by content editors. They check in static HTML files, and we include these files in various parts of the application. The problem is that they sometimes copy and paste from applications such as Outlook or Word, which can introduce unprintable characters into the application. These characters show up strangely on the website.
After this happened a couple of times, we decided to write a test to ensure that we would always catch the unprintable characters:
class NonPrintableCharactersTest < Test::Unit::TestCase
def test_for_non_printable_characters_in_content
assert_equal "", `find #{RAILS_ROOT}/content -name '*.html' | xargs grep -n '[^[:space:][:print:]]'`
end
end
We use find to get a list of all of the html files in the content folder. Then, we pipe this to grep, using the regular expression
'[^[:space:][:print:]]'which matches anything except spaces or printable characters. The output of this test looks like:
Loaded suite test/non_printable_characters_test
Started
F
Finished in 0.86005 seconds.
1) Failure:
test_for_non_printable_characters_in_content(NonPrintableCharactersTest) [test/non_printable_characters_test.rb:5]:
<""> expected but was
<"/some/path/to/content/tmp.html:48:character �</span></p>\n">.
1 tests, 1 assertions, 1 failures, 0 errors
The failure message shows the file and line with the character, so it is easy to fix.
Comments
-
Simply BRILLIANT!! This illustrates the 'magic' of Ruby that makes it so much fun.
-
@Thom: This is actually a shell command wrapped in a Ruby test. It can be adapted to any scripting language.
-
Just note that the find is not entirely safe. I'd add "-X" to the command line: -X The -X option is a modification to permit find to be safely used in conjunction with xargs(1). If a file name contains any of the delimiting characters used by xargs, a diagnostic message is displayed on standard error, and the file is skipped. The delimiting characters include single (') and double (") quotes, backslash (\), space, tab and newline characters. While generally harmless in your script, if it were a destructive tool (like rm) it could hit the wrong file.