Building a File Parser with an Abstract Syntax Tree


I was familiar with the Swift Compiler Architecture, but only recently I came across SwiftSyntax, which - as from its own description - is:

“[…] a set of Swift bindings for the libSyntax library. It allows for Swift tools to parse, inspect, generate, and transform Swift source code.”

Well, seems exactly what I needed for CoherentSwift 🎉

It is used by the Swift Compiler in its very first task and is responsible for generating an Abstract Syntax Tree (AST), which will be taken down the further steps. I suggest you get familiar with how this process works reading a little about the compiler architecture.

Among the tools that already use SwiftSyntax we find code formatters such as SwiftRewriter, swift-format and others like Piranha - Uber’s own tool to refactor code related to stale flags. One thing they have in common, these tools modify/edit your code.

To help us understand how it works and what you can do with SwiftSyntax we have the post from Mattt on NSHipster, which sums up to:

I really suggest you take a look at it.

SyntaxRewriter

As already seen above, most of the tools using SwiftSyntax modify code, which in my view turns out to be the biggest strength of the lib, due to the fact that these care about snippets of code, one at a time, they go incrementally through pieces of TokenSyntax and act differently depending on their tokenKind. This is done through SyntaxRewriter’s visit(_:) methods.

This couldn’t be more straight forward! Let’s see…

  1. Visit a TokenSyntax;
  2. Given it’s kind, do something with it;
  3. Return the modified token.

SyntaxFactory 🏭

This is a set of high-level APIs for creating code. Anything, literally anything. It isn’t necessarily convenient as you’d have to create syntax one by one. Given the need to create a static property static let shared = Something(), we’d make at least 5 calls to the APIs:

  1. makeStaticKeyword;
  2. makeLetKeyword;
  3. makeIdentifier;
  4. makeEqualToken;
  5. ObjectCreationExpression

It’s at best exhausting to let code write code.

You can check the available APIs going through all the 5525(!) lines of SyntaxFactory’s code.

But… Hm… I don’t want to edit or write code 🤔

With CoherentSwift I don’t want to create nor edit code, and here SwiftSyntax is extremely helpful - just not as much as it could be.

Similar to Rewriter and Factory, we have SyntaxVisitor. It allow us to go through TokenSyntax (a.k.a walk all nodes of the tree), and by overridding it’s visit(_:) methods we can parse/analyse given TokenSyntax. The visit method’s return a SyntaxVisitorContinueKind, an enumeration:

See this basically as a Bool, for every visit, should it move forward and also pay a visit to its children, or should it stop right now?

Visits

In AST we have syntaxes for everything, and we have at least double the number of methods for them, but before going through a few specific examples, let’s cover the generic visit(_:) method here.

visit

This couldn’t be more generic, it goes through every node. If you do want to parse them all using something in common, great, and if you want to deal with different TokenKind in different ways, a switch case is your friend.

However, we do have visit(_:) methods for specific TokenKind too. If we want to parse entire classes:

This gives us easy access to the class syntax, with immediate properties such as identifier, attributes, modifiers, members and more.

But remember, all this is abstract, they will have more and more abstract syntaxes within them, where at one point we can also find properties and functions.

Overriding multiple visits

I find it useful to override multiple visit(_:) methods, as I want to process them in completely different ways, i.e.:

I also want to keep a record of this tree so that I can measure the cohesion of this code, for that I want to know specific things:

  1. Which high definitions (Struct, Class) I find in a file
  2. Which properties are members of these definitions
  3. Which methods are members of these definitions
  4. Which properties are members of these methods
  5. Which methods are private
  6. Which properties are static
  7. Which high definition has extensions

This is enough for CoherentSwift to measure the cohesion for the given code.

When cohesion is high, it means that the methods and variables of the class are co-dependent and hang together as a logical whole.

  • Clean Code pg. 140

Struggles with Parsing

In my recent experience, it’s extremely easy to go through the high-level tokens within another, but going down the rabbit hole trying to map an entire class at once turned out to be very error-prone, for this reason, I’m following two approaches at the moment:

  • Map immediate members of a definition within the same visit.

    For this purpose, I’ve created a Factory to give back the expected properties for a given node: class visit

  • Expect another visit to a different syntax (i.e.: FunctionDeclSyntax)

    Here, the factory also processes high-level tokens looking for properties and return our very own CSMethod after assigning the found [CSProperty] to it. method visit

The downside of the last approach is, with all the abstract syntaxes, it is also error-prone to climb up the tree as it is climbing down, so I had to keep track of the currentDefinition being processed if any.

Post Visits

If you’ve seen enough of visit(_:), how about visitPost(_:)?

You read it right, for every visit there is a visitPost. A post is called after a visit has been paid to the given syntax and all its descendants, it, therefore, doesn’t return any value. It’s very useful for post-processing.

Conclusion

SwiftSyntax is very powerful, not always easy - depending on what you want to achieve, but has certainly increased the accuracy of CoherentSwift’s measurement, as part of the upcoming 0.5.0 release.