On the Topic of Jets: Disentangling Quarks and Gluons at Colliders

We introduce jet topics: a framework to identify underlying classes of jets from collider data. Because of a close mathematical relationship between distributions of observables in jets and emergent themes in sets of documents, we can apply recent techniques in"topic modeling"to extract jet topics from data with minimal or no input from simulation or theory. As a proof of concept with parton shower samples, we apply jet topics to determine separate quark and gluon jet distributions for constituent multiplicity. We also determine separate quark and gluon rapidity spectra from a mixed Z-plus-jet sample. While jet topics are defined directly from hadron-level multi-differential cross sections, one can also predict jet topics from first-principles theoretical calculations, with potential implications for how to define quark and gluon jets beyond leading-logarithmic accuracy. These investigations suggest that jet topics will be useful for extracting underlying jet distributions and fractions in a wide range of contexts at the Large Hadron Collider.

In this letter, we introduce a data-driven technique to extract underlying distributions for different jet types from mixed samples, using quark and gluon jets as an example. We call our method "jet topics" because of a mathematical connection to topic modeling, an unsupervised learning paradigm for discovering emergent themes in a corpus of documents [40]. Jet topics are defined directly from measured multi-differential cross sections, requiring no inputs from simulation or theory. In this way, jet topics offer a practical way to define jet classes, allowing us to label "quark" and "gluon" jet distributions at the hadron level without reference to the underlying partons.
At colliders like the LHC, it is nearly impossible to kinematically isolate pure samples of different jets (i.e. quark jets, gluon jets, boosted W jets, etc.). Instead, collider data consist of statistical mixtures M a of K different types of jets. For any jet substructure observable x, such as jet mass, the distribution p Ma (x) in mixed sample M a is a mixture of the K underlying jet distributions p k (x): where f (a) k is the fraction of jet type k in sample a, with K k=1 f (a) k = 1 for all a and dx p k (x) = 1 for all k. For the specific case of quark (q) and gluon (g) jet mixtures, we have: Of course, there are well-known caveats to this picture of jet generation, which go under the name of "sample dependence". For instance, "quark" jets from the Z+jet process are not exactly identical to "quark" jets from the dijet process due to soft color correlations with the entire event [37], though these correlations are power suppressed in the small-jet-radius limit [41][42][43]. Also, more universal quark/gluon definitions can be obtained using jet grooming methods [44][45][46][47][48][49][50][51][52]. Here, we assume that sample-dependent effects can either be quantified or mitigated, taking Eq. (2) as the starting assumption for our analysis. Mixed quark/gluon samples were previously studied in the context of Classification Without Labels (CWoLa) [53] (see also [54][55][56][57][58]). Via Eq. (2), one can prove that the optimal binary mixed-sample classifier, p M1 (x)/p M2 (x), is a monotonic rescaling of the optimal quark/gluon classifier, p q (x)/p g (x). This means that a classifier trained to optimally distinguish M 1 (e.g. Z+jet) from M 2 (e.g. dijets) is optimal for distinguishing quark from gluon jets without requiring jet labels or aggregate class proportions. The CWoLa framework, though, does not directly yield information about the individual quark and gluon distributions p q (x) and p g (x).
With jet topics-and with topic modeling more generally-one can obtain the full distributions p k (x) and fractions f   in Eq. (1), subject to requirements which will be spelled out below. As originally posed, topic modeling aims to expose emergent themes in a collection of text documents (a corpus) [40]. A topic is a distribution over words in the vocabulary. Documents are taken to be unstructured bags of words. Each document arises from an unknown mixture of topics: a topic is sampled according to the mixture proportions and then a word is chosen according to that topic's distribution over the vocabulary. As long as each topic has words unique to it, known as anchor words [59,60], topic-modeling algorithms can learn the underlying topics and proportions from the corpus alone. Intriguingly, the generative process for producing counts of words in a document is mathematically identical to producing jet observable distributions via Eq. (1), as summarized in Table I. For the case of quark/gluon jet mixtures, we have suggestively depicted the process of writing "jet documents" in Fig. 1. Anchor words are analogous to having phase-space regions where each of the underlying distributions is pure, and the presence of these anchor bins is necessary for jet topics to yield the underlying "quark" and "gluon" distributions.
Due to its theoretical transparency and asymptotic guarantees, we use the Demix method [60] to extract jet topics, though other algorithms yield comparable results. The key idea is to undo the mixing of the two fundamental distributions in Eq. (1) by maximally subtracting the two mixtures from one another, such that the zeros of the subtracted distributions correspond to the anchor bins. Adopting the notation of Ref. [60], let κ(M 1 |M 2 ) be the largest subtraction amount κ such that We refer to κ as the reducibility factor (equivalently, the minimum of the mixed-sample likelihood ratio). The jet topics T 1 and T 2 are then the normalized maximal subtractions of M 2 from M 1 , The generation of mixed samples of quark and gluon jets, highlighting the correspondence with topic models. Each jet is either a quark or gluon jet, sampled according to the underlying quark fraction. The observable is then sampled according to a universal distribution for that jet type. Each mixed-sample observable distribution is then a mixture of the two universal distributions, giving rise to a "jet document". and analogously for p T2 (x). The jet topics are unique and universal, in that they are independent of the mixtures used to construct them. The goal is for the topic distributions p T1 (x) and p T2 (x) to match the underlying quark and gluon jet distributions p q (x) and p g (x). There are three required conditions for this to occur. Two of them (shared with CWoLa) are sample independence and different purities, i.e. that the jet samples are obtained from Eq. (2) with different values of f (a) q . The third condition is the presence of anchor bins, which can be stated more formally as: Mutual irreducibility: Each underlying distribution p k (x) is not a mixture of the remaining underlying distributions plus another distribution [55].
Note that this is a much weaker requirement than the distributions being fully separated. In the quark/gluon context, a necessary and sufficient condition for mutual irreducibility is that the reducibility factors κ(q|g) = κ(g|q) = 0 for feature representation x. We later explore the implications of this condition for QCD. With these three conditions satisfied, the mixture proportions are uniquely determined via the reducibility factors. Taking f Even without mutual irreducibility, the extracted jet topics will still relate to the underlying quark and gluon distributions. Specifically, jet topics yield the "gluonsubtracted quark distribution": and the "quark-subtracted gluon distribution", defined analogously. By universality, the topics calculated from pure samples via Eq. (6) and from mixtures via Eq. (4) are identical. These may be useful in their own right, particularly if the quark/gluon fractions are uncertain but κ(q|g) and κ(g|q) can be determined analytically or from simulation (see Fig. 4

below).
We now turn to a practical demonstration of the jet topics method for realistic quark and gluon samples. Following Ref. [37], we consider two mixed jet processes at the LHC: the quark-enriched Z+jet process and the gluon-enriched dijets process. See Ref. [61] for alternative selections for quark-or gluon-enriched samples. The parton shower Pythia 8.226 [62,63] is used to generate 500k jets at √ s = 13 TeV including hadronization and multiple parton interactions (i.e. underlying event). Detector-stable, non-neutrino particles are clustered into anti-k t jets [64] with radius R = 0.4 using We use the constituent multiplicity within a jet as the feature representation x, since it is known to be a good quark/gluon discriminant [18].
In Fig. 2, we present the result of extracting two jet topics from these samples. Shown are the constituent multiplicity distributions from the original Z+jet and dijet samples, from Pythia-labeled Z+quark and Z+gluon samples, and from the jet topics T 1 and T 2 using Eq. (4). Uncertainties are estimated by assuming ± √ N bin count uncertainties and only considering bins with more than 30 events. We determine the κ values of Eq. (3) by selecting the most constraining (anchor) bin: that with the lowest upper uncertainty bar on the ratio. Remarkably, the two extracted jet topics overlap very well with the underlying quark and gluon distributions, providing practical evidence that Eq. (4) works as desired, at least for constituent multiplicity. We verified that similar results could be obtained from samples with different p T cuts and from mixtures of dijets at different rapidities. This approach is similar to the template extraction procedure in Ref. [24], with the important distinction that the quark/gluon fractions need not be specified a priori. In Fig. 3, we use the extracted jet topics to construct separate jet rapidity spectra for quark and gluon jets in the Z+jet samples. Binning the Z+jet sample into 10 rapidity bins in |y| < 2, we find the mixture of the two topics extracted above that most closely matches the constituent multiplicity histogram in each rapidity bin, minimizing the squared error to find the best mixture. This is an example of the general problem of extracting sample fractions f (a) k from various mixed samples. As desired, the extracted topic cross sections in Fig. 3 track the true quark and gluon rapidity cross sections.
Thus, just from a collection of mixed-sample histograms, one can make progress toward extracting both the underlying distributions p k (x) and the fraction of each jet topic f (a) k . Crucially, Figs. 2 and 3 are just novel projections of the hadron-level multi-differential jet cross section d 3 σ/dp T dy dn const on two independent samples, making jet topics implementable on existing LHC jet measurements (e.g. [33]). The agreement between the operationally-defined jet topics and the theoretically-ambiguous quark and gluon distributions may even suggest using mutual irreducibility of the final-state distributions to define "quark" and "gluon" jets.
From the perspective of first-principles QCD, the implications of mutual irreducibility are simple yet profound. For the reducibility factors κ(q|g) and κ(g|q) to be zero, there must be phase-space regions almost entirely dominated by quark or gluon jets. In the leading-logarithmic (LL) limit, mutual irreducibility can be achieved with any jet substructure observable that counts the number of parton emissions, such as "soft drop multiplicity" [26]. At LL order, quark and gluon jets have the same emission profile, differing only by a color factor in their emission density, C F = 4/3 for quarks and C A = 3 for gluons. Ignoring the Λ QCD regulator, counting these (infinitely many) emissions results in arbitrarily wellseparated quark and gluon Poissonian distributions [26], and therefore mutual irreducibility. Beyond LL order, though, naive quark/gluon definitions may not lead to mutual irreducibility, since running-coupling, higherorder, and non-perturbative effects generically contaminate the anchor bins. That said, as long as these effects maintain sample independence (perhaps achieved via grooming), then one can still use Eq. (6) to define subtracted "quark" and "gluon" labels.
Interestingly, many jet substructure observables do not lead to quark/gluon mutual irreducibility, even at LL accuracy. Consider for instance the jet mass m (or any jet angularity [66][67][68]). Jet mass exhibits Casimir scaling at LL order, meaning that the cumulative density functions Σ i (m) are related to each other by Σ g = Σ [19,20]. The probability distributions are then given by p i = dΣ i /dm. Substituting this into Eq. (3) immediately yields, for all observables with Casimir scaling: since C A /C F = 9/4 > 1 and Σ takes all values between 0 and 1. Because of Eq. (8), jet mass alone is not sufficient to extract the quark distribution at LL order without additional information.
On the other hand, if the reducibility factors are known, then the subtracted distributions in Eq. (6) can be inverted. This is shown in Fig. 4 for the jet mass, where the "quark" topic has been corrected using the value κ(q|g) = 0.40 at 35 GeV determined from the Pythia Z + q/g distributions, which is known to differ from the LL expectation [37]. This analysis is performed up to 35 GeV to avoid sample-dependent effects in the high-mass tails of the distributions. The qualitative behavior of the topics agrees with the LL predictions of Eqs. (7) and (8): no correction is needed to obtain the "gluon" topic, and the "quark" topic is a non-trivial mixture of the jet topics. Given the good agreement seen here, it would be interesting to apply jet There is good agreement between the κ-corrected quark topic (gray) and the pure Z+quark distribution (red).
There are many potential uses for the jet topics framework at the LHC. Focusing just on quark and gluon jets, one often wants to separately measure quark and gluon distributions from mixed data samples, without relying on theory or simulation for fraction estimates.
To determine PDFs, it would be beneficial to isolate different partonic subprocesses, and this could be feasible as long as jet topics is applied both to data and to fixedorder QCD calculations. Similar subprocess isolation might be useful in mono-jet searches for dark matter by aiding in signal/background discrimination or in setting improved limits on specific new physics models [71,72]. For extracting the strong coupling constant α s from (groomed) jet shape distributions, it would be beneficial to determine the quark and gluon jet fractions using datadriven methods, since there are uncertainties associated with whether α s comes multiplied by C F or C A [73]. The extracted topic fractions could be also be used to augment training with CWoLa, since the classifier operating points could then be determined entirely from data. In heavy ion collisions, quarks and gluons are expected to be modified differently in medium due to their different color charges, and jet topics may allow for fully data-driven studies of separate quark and gluon jet modifications.
In conclusion, phrasing jet mixtures as a topic modeling problem makes available a variety of new and more sophisticated statistical and mathematical tools for jet physics (see e.g. [59,60,[74][75][76][77][78][79][80][81][82][83][84][85][86][87][88]), including recent efforts to determine the appropriate number of topics to use from data [89][90][91]. We emphasize that jet topics can be applied to any set of multi-differential cross sectionsin experiment or in theory-as long as the criteria of sample independence, different purities, and mutual irreducibility are met. Furthermore, mutual irreducibility need not be assumed if the subtracted distributions in Eq. (6) are sufficient for the intended application, or if the reducibility factors are known from theory or simulation. Of course, experimental studies are needed to understand the systematic and statistical uncertainties associated with jet topics for LHC measurements and searches, and theoretical studies are needed to determine the interplay of jet topics with precision calculations. It would also be interesting to design jet substructure observables specifically targeted for mutual irreducibility. More generally, topic models may find applications in collider physics beyond jets and in other disciplines beyond collider physics, since extracting signal and background distributions from mixtures is a ubiquitous challenge faced when analyzing and interpreting rich data sets. The