Coverage-guided grammar fuzzing is a well-established technique for finding bugs in complex targets such as language runtimes, browser engines, and XSLT processors. By constraining mutations to samples that still conform to a predefined grammar, the approach ensures structural validity while using code-coverage feedback to guide the fuzzer toward interesting inputs. A Google Project Zero researcher has now published a detailed analysis of what they consider the two core weaknesses of this method, along with a mitigation strategy used in their open-source Jackalope fuzzer.
Issue 1: More Coverage Does Not Equal More Bugs
The first problem is that code coverage is a poor proxy for bug-triggering conditions, a limitation that affects coverage-guided fuzzing broadly but is especially acute in language fuzzing. Many bugs require specific chains of function calls where the output of one function feeds into another. As a concrete illustration, the researcher points to a recent libxslt bug requiring the document() XPath function’s result to be passed as input to generate-id().
A coverage-guided fuzzer may discover both functions independently, storing them in separate corpus samples. Because the union of coverage from those two samples equals the coverage of the combined, bug-triggering sample, the fuzzer has no signal that combining them would be valuable. Even a single sample containing both functions operating on independent data produces the same coverage, while remaining far from triggering the bug. When only two function calls need chaining, random mutation will eventually stumble onto the combination. With three or more required calls, the cost grows sharply and coverage feedback provides little help. The researcher notes that a purely generative fuzzer, with no coverage feedback at all, might find such bugs equally efficiently in some cases.
A dataflow-aware coverage metric, one that could detect novel data paths between functions, would theoretically address this, but no practical implementation of that approach is currently known to the researcher.
Issue 2: Corpus Samples Converge Toward Similarity
The second issue is that mutational grammar fuzzers tend to accumulate corpus samples that are structurally very similar to one another. Because each new sample is derived from an existing corpus entry, and only saved if it triggers new coverage, the corpus can grow into a collection of nearly identical inputs. This reduces the diversity of paths explored and can leave large regions of the target’s logic underexplored.
A Simple Mitigation
The researcher describes a straightforward technique applied within Jackalope to counter both issues. While the specifics extend beyond the excerpted source material, the broader point is that practitioners using grammar fuzzers should be aware that corpus diversity and inter-sample combination strategies matter as much as raw coverage growth.
The post also notes that these limitations are not unique to grammar fuzzing. Other structure-aware fuzzing techniques, including approaches like Fuzzilli for JavaScript engine fuzzing, face similar challenges when coverage alone guides corpus selection.
