In the previous article we looked at writing our own CodeSniffer standard based on pre-existing rules or sniffs.
This article will try to cover in-depth how CodeSniffer actually works to give insight into the next proposed article, writing a sniff from scratch.
The PHP Tokenizer
CodeSniffer works by extending the PHP tokenizer function.
Given the following section of code:
<?php function DoSomething(array $foo) { print_r($foo); } ?>
and running it through PHP’s native tokenizer we get the following output
PHP Tokenized version of foo.php
PHP’s tokenizer only identifies a limited subset of PHP syntax as listed here.
All other potential tokens get either identifies as a string, signified by the sub array[0] integer value of 307 or the constant T_STRING
, else it simply returns the string value of the token i.e. those seen array values 17,18 and 20 in the sample output above.
To map the above integer values to PHP’s string constant name you can use the PHP function token_name()
For example:
$ php -r "print(token_name(369));" T_CLOSE_TAG
As once the PHP tokenizer has run, we have a lot of code still encapsulated as T_STRING or with no tokenizing done, CodeSniffer takes these simple tokens and expands them further.
CodeSniffer introduces new constants such as T_TRUE
, T_FALSE
, T_NULL
, T_PARENT
, T_OPEN_CURLY_BRACKET
and so on.
This gives CodeSniffer considerable scope to be able to handle much finer detail of the PHP syntax.
The PHP CodeSniffer Tokenizer
When CodeSniffer first loads, the standard in use is determined from the command line or from the stored config. The standard is then loaded and all of that standard’s sniffs are loaded.
Each of the sniffs gets called via the register() method and a hash of all the tokens and classes is created.
Then CodeSniffer starts looking for the files to check, if a directory is specified, CodeSniffer iterates through the directories until a file with the correct extension is found, then each file is processed in turn.
If a list of files or a single file is specified, then the above step is skipped and CodeSniffer starts parsing the file(s) as defined in the parameters.
Once CodeSniffer has tokenized the file under analysis into one (rather large) multidimensional array of language syntax tokens, the rest is quite simple.
CodeSniffer breaks each file under examination down and does a series of context checks before processing the tokens and calling all the registered sniffs.
These checks are:
- Bracket Map: checking braces
- Scope Map: checking for class, function and conditional statement scopes
- Level Map: checking for class, function and conditional statement levels
If we look at the CodeSniffer Tokenized version of foo.php we can see the levels of our sample script above.
Each Sniff in the standard has registered which tokens they are interested in being invoked to handle during the initialisation phase.
CodeSniffer then runs through each token in the file from beginning to end and calls all of the sniff process() method for sniffs that registered and interest in that token.
Finally all of the errors and warnings generated by those sniffs are organised into the desired report type and displayed.
So now we have an insight into how CodeSniffer works, in the next and final post in this series on CodeSniffer, we’ll look and writing a new Sniff.