Scale-free Networks Well Done

We bring rigor to the vibrant activity of detecting power laws in empirical degree distributions in real-world networks. We first provide a rigorous definition of power-law distributions, equivalent to the definition of regularly varying distributions in statistics. This definition allows the distribution to deviate from a pure power law arbitrarily but without affecting the power-law tail exponent. We then identify three estimators of these exponents that are proven to be statistically consistent -- that is, converging to the true exponent value for any regularly varying distribution -- and that satisfy some additional niceness requirements. Finally, we apply these estimators to a representative collection of synthetic and real-world data. According to their estimates, real-world scale-free networks are definitely not as rare as one would conclude based on the popular but unrealistic assumption that real-world data comes from power laws of pristine purity, void of noise and deviations.


I. INTRODUCTION
Scale-free and power-law are sacral words in network science, a mature field that studies complex systems in nature and society by representing these systems as networks of interacting elements [1][2][3][4]. The most basic property of any network, second only to the network size and average degree, is the degree distribution, and the early days of network science were filled with the surprising and exciting news that degree distributions in many realworld networks of completely different origins are scalefree, i.e., "close to power laws." This property means that the node degrees in a network are highly variable and lack a characteristic scale, with a multitude of profound and far-reaching implications on a wide spectrum of structural and dynamical properties of networks [1][2][3][4][5][6][7][8]. These implications are the reason why these scale-free findings were extremely impactful, and why they steered the whole field of network science in the direction it has followed for nearly two decades. They impacted essentially all the key aspects of network science, from the basic tasks of network modeling, all the way down to concrete applications, such as prediction and control of the dynamics of real-world complex systems, or identifying their "Achilles' heels" [1][2][3][4].
Yet there is one glaring problem behind all these exciting developments. The problem is that scale-free networks do not have any widely agreed-upon rigorous definitions. Specifically, it is quite unclear what it really means for a degree sequence in a given real-world network to be power-law or "close" to a power law. This lack of rigor has led and still leads to confused controversy and neverending heated debates [9][10][11][12][13][14][15][16][17][18][19][20][21]. This controversy has culminated in the recent attempt in [19] to tell how "scale-free" real-world networks are, leading to the provocative but misleading conclusion that "scalefree networks are rare." Faced with the question whether a given real-world network is scale-free or not, one first has to decide if s/he can trust the data-does the measured degree sequence reflect the actual degree sequence in the network? We do not address this question here, and assume that we can trust the data. Under this assumption, the next questions are: 1. What exactly does it mean that a distribution is approximately a power law?
2. What are the correct, i.e., statistically consistent, methods to estimate the tail exponents of this power law from the measured degree sequence?
3. How likely is it that the measured sequence comes from a power law with the estimated exponent?
Here we address all these three questions.
One of the most frequently seen formula in the early days of network science was Its intended story was to tell that the fraction P (k) of nodes of degree k in a network under consideration decays with k approximately as a power law with exponent γ. The symbol '∼' could mean anything, usually something like "roughly proportional." The literature was also abundant with plots of empirical probability mass/density functions (PDFs) P (k) and complimentary cumulative distribution functions (CCDFs) F (k) of degrees k drawn on the loglog scale to illustrate that these functions are "roughly straight lines," so that the network is power-law, thus deserving a publication. The first popular attempt to introduce some rigor into this vibrant activity came in [18], when network science was about a decade old. In [18], Eq. (1) was taken lit-arXiv:1811.02071v1 [physics.soc-ph] 5 Nov 2018 erally to mean that P (k) for k ≥ k min is exactly proportional to k −γ , i.e., where c is the normalization constant. But complexly mixed stochastic processes driving evolution of many different real-world networks are of different origins and nature. Worse, they all are prune to different types and magnitudes of noise and fluctuations. Therefore, the basic common sense suggests that these processes can hardly produce beautifully clean powerlaw dependencies void of any deviations from (2). This is similar to how one cannot expect Newton's laws on Earth with friction to yield results as beautiful as Newton's laws in the empty space without friction [21]. That is why it is not surprising that if one looks for such idealized power-law dependencies in real-world networks, s/he is doomed to find them quite rare [19]. And as far as power-law network models are concerned, even the most basic such model, preferential attachment, is known to have a degree distribution with a power-law tail, but the exact expression for the degree distribution in preferential attachment networks is not a pure power law (2), as shown in [22,23]. In fact, power-law network models with the pristine purity of (2) are an exception rather than the rule.
For all these reasons, in statistics one considers the class of regularly varying distributions [24][25][26][27] instead of rarely used pure power laws. Compared to the unjustifiably restrictive distribution class (2), the class of regularly varying distributions is much larger. In particular, it contains all the distributions whose PDFs are given by thus allowing for deviations from pure power laws by means of a slowly varying function (k), i.e., a function that varies slowly at infinity, classic examples including functions converging to constants or log a k for any constant a. Compared to (2), any distribution in this class has the same tail, but it can have drastically different shapes for finite degrees. The exact shape of (k) is of much less significance than the value of the tail exponent γ, because it is γ, and not (k), that is solely definitive for a number of important structural and dynamical properties of networks [5][6][7][8][28][29][30][31][32][33][34]. Therefore, following this well-established tradition in statistics, in Section II we define a distribution to be power-law if it is regularly varying.
The next question, that we address in Section III, is how to properly estimate the value of γ under the assumption that a given degree sequence comes from a regularly varying distribution. This question has attracted extensive research attention in statistics and finance [24][25][26][27][35][36][37][38], where a variety of estimators have been developed for this task. We identify the maximal subset of such estimators that, to the best of our knowledge, are the only currently existing estimators that 1. are applicable to any regularly varying distribution; 2. are statistically consistent, i.e., have been proven to converge to the true γ, if applied to increasinglength sequences sampled from any regularly varying distribution; and 3. can be fully automated by the means of the double bootstrap method that has been proven to yield the optimal estimation of γ for any finite sequence of numbers sampled from any regularly varying distribution.
It is important to stress here that (2) is just one representative of the extremely wide class of regularly varying distributions (3). Therefore, as opposed to the methods in [18,19] that may be consistent only under the assumption that a given degree sequence comes from a pure power law (2), the estimators that we discuss in Section III are proven to be consistent under the much more general assumption that the sequence comes from any impure power law, including any distribution that satisfies (3) with any non-trivial slowly varying function (k).
In Section IV we validate that this is indeed the case by applying these estimators to a wide range of synthetic sequences sampled from a variety of regularly varying distributions, as well as to degree sequences in paradigmatic network models-the configuration model, preferential attachment, and random hyperbolic graphs. In all the considered cases, all the considered estimators converge as expected, in contrast to the PLFit algorithm from [18,19], which behaves abnormally whenever the regularly varying distribution has a non-trivial slowly varying function (k). Remarkably, one of such cases is preferential attachment. The degree distribution in preferential attachment networks was proven to be scalefree [39], but the hypothesis testing methodology in [19] rejects this statement for high percentages of simulated preferential attachment networks. The reasons behind this abnormality are also documented in Section IV.
As opposed to the PLFit, most of the estimators considered in this paper have been proven to be consistent not only under the assumption that the sampling distribution P (k) is regularly varying, but also under the even more general assumption that it is any distribution belonging to the maximum domain of attraction of any extreme value distribution with any index ξ, which is the main parameter of an extreme value distribution. The extreme value distributions are the n → ∞ limit distributions of rescaled maximum values among n samples from any given distribution P (k). If P (k) is regularly varying, then ξ is strictly positive, and the tail exponent γ and extreme value index ξ are related by If P (k) is not regularly varying, then ξ is either negative or zero, in which case the tail exponent γ is undefined. None of the considered estimators estimates γ directly.
They all are based on extreme value theory, so that they estimate the index ξ instead. The last question from the list of the three questions above is about hypothesis testing. Given any degree sequence, one can always apply to it any ξ-estimator that will always return some ξ-estimateξ. How likely is it that this sequence comes from a regularly varying distribution with exponent γ = 1 + 1/ξ? Clearly, ifξ is either negative or zero, then this question is ill-posed since one cannot even tell what the γ is. But what ifξ is positive?
Section V is dedicated to the explanation that even in this case one cannot devise any hypothesis test to answer question (3) above. If one desires to play here with p-values and their vagaries [40], then this desire is fundamentally flawed since such hypothesis testing activity is simply impossible with regularly varying distributions. Intuitively, the main reason for this impossibility is the infinite amount of "degrees of freedom" contained in the space of slowly varying functions (k) that make the space of regularly varying distributions nonparametric. In particular, there is an infinite number of regularly varying distributions such that for any finite sequence length, degree sequences of this length sampled from them do not appear to be regularly varying, or the other way around, there is an infinite number of distributions that are not regularly varying, but such that random sequences of any finite length sampled from them appear as regularly varying.
In view of this extremely important but badly overlooked observation, which is one of the key points in the paper, the best strategy one can follow is to consult as many consistent γ-estimators as possible to see whether they agree on the ranges of their γ-estimates on a given sequence [24]. And this is indeed the strategy we follow in Section V to define what it means for a given network to be power-law. If at least one of the considered estimators returns a negative or zero value ofξ, then we call the network not power-law, but if all the estimators agree thatξ > 1/4, then we say that the network is power-law. If none of these conditions are satisfied, then we say that the network is hardly power-law. The thresholdξ = 1/4 between the power-law and hardly power-law ranges is completely arbitrary, and one is free to choose any nonnegative value of ξ for this threshold, determining the value of γ above which one can hardly call a network power-law. We chose this value to be γ = 1 + 1/ξ = 5 for the reasons discussed in Section V.
Finally, in Section VI, we implement all the considered estimators in a software package [41] available to the public, and apply them to 115 real-world networks with more than 1, 000 nodes collected from the KONECT database [42]. The collection contains many paradigmatic networks from different domains. Some of them were found to be power-law in the past (the Internet, for instance), while others were documented to be not powerlaw (road networks are a classic example). We find that the considered consistent estimators mostly agree with this classification, while overall, according to the defini-tions above, these estimators report that 49% of the considered undirected networks are power-law. Among the considered directed networks, 24% are power-law according to both in-and out-degree sequences, while 82% are power-law according to either in-or out-degree sequence. The bipartite networks exhibit a similar picture according to the estimators: 35% of them are power-law with respect to both types of nodes, while 74% are power-law according to at least one type of nodes.
In summary, if we relax the unrealistic requirement that degree distributions in real-world networks must be pure power laws, and allow for real-world impurity via regularly varying distributions, then upon the application of the state-of-the-art methods in statistics to detect such distributions in empirical data, we find that one can definitely not call scale-free networks "rare."

II. POWER-LAW DISTRIBUTIONS
We define a distribution to be power-law if it is regularly varying [25,26]. A distribution with PDF P (k) is regularly varying if its CCDF is where α > 0, and (x) is a slowly varying function, meaning that for any t > 0. If the PDF of a distribution satisfies (3) with some slowly varying function, then the distribution is regularly varying, i.e., its CCDF satisfies (6) with some other slowly varying function. The converse may or may not be true, as discussed in Appendix A. The definitions above formalize the point that the distribution exhibits a power-law tail at high degrees, but has an arbitrary shape at small degrees. They follow the well-established convention in probability, statistics, and finance [24][25][26][27][35][36][37][38], where regularly varying distributions are the best studied subclass of much larger classes of distributions, such as heavy-tailed and others, see Appendix A.
We also note that the definition in (6) allows one to make a rigorous sense out of the '∼' sign in (1). If one draws on the loglog scale the functions log(ck)k −α and Ck −α , for instance-both regularly varying since both log(ck) and C are slowly varying functions of k for any constants c, C > 0-all one would see is a straight line at large k. This observation is formalized by Potter's Theorem [25,Theorem 1.5.6], stating that for any slowly varying function (k) and any α > 0, lim k→∞ (k)k −α = 0. Therefore, in both cases one would be tempted to write F (k) ∼ k −α , so that the power-law definition (6) is a perfect way to hide any distributional peculiarities that do not asymptotically influence the power-law shape of the distribution tail.
We emphasize here that due to the nature of slowly varying functions, definition (6) is intrinsically asymptotic, dealing with the k → ∞ limit. In particular this implies that a distribution satisfying (6) can take any form for all degrees k < K below arbitrarily large but fixed threshold K > 0.
The simplest and most frequently seen examples of regularly varying distributions can be found in Appendix A.

III. CONSISTENT ESTIMATORS OF THE TAIL EXPONENT
We now turn to the question of how to estimate the tail exponent of a regularly varying distribution given a finite collection of samples (e.g., node degrees) from it. We employ three estimators-Hill [43], Moments [44] and Kernel [45]-that are proven to be statistically consistent at this task. Consistency means that as the number of samples increases, the estimated values of the exponent are guaranteed to converge to the true exponent value regardless of the slowing-varying function (k). This is in contrast with the PLFit, the power-law exponent estimator in [18,19], which is likely to be consistent only if the regularly varying distribution is a "clean power law," i.e., the zeta or Pareto distribution with constant (k), and possibly if (k) converges to a constant.
We note that all the considered estimators are consistent under the assumption that the data that they are applied to is a collection of i.i.d. (independent identically distributed) samples from a regularly varying distribution. There is no, and cannot be any, hypothesis testing procedure that would tell whether a given sequence (of degrees in a (real-world) network) is an i.i.d. sequence from a regularly varying distribution, as we explain in detail in Section V. Therefore the application of these estimators to degree sequences of real-world networks can be justified only indirectly. In particular, their consistency has been recently proven for a wide range of preferential-attachment models, in which degree sequences are not exactly i.i.d. [46]. In case of the configuration model [47][48][49], it is known that a degree sequence sampled i.i.d.'ly from a distribution with finite variance is graphical with positive probability [5,Theorem 7.21]. This probability is very close to 1/2 for any distribution with a finite mean that takes odd values with positive probability, a surprising fact proven in [50]. This means that random graphs with a power-law degree distribution can be sampled by first sampling i.i.d.'ly a degree sequence from the distribution, and then constructing a graph with this degree sequence using known techniques [51]. Such a graph exists with non-zero probability because the degree sequence is graphical with this probability. Applied to graphs constructed this way, the estimators are consistent because the degree sequences in these graphs are i.i.d. Yet proving the consistency of these estimators applied to other network models is an open research area, which is only tangentially related to justifying their application to real-world networks, since there cannot be any "ultimately best" model for any realworld network. We also note that these estimators are actively employed in practice, in particular in financial mathematics [36,38,[52][53][54], where regularly varying distributions are abundant, the estimation of rare events is of key importance, and where the i.i.d. assumption cannot be proved to hold in real-world data either.
All the considered estimators do not estimate either the PDF or CCDF tail exponents γ or α = γ − 1 directly. Instead, they estimate the inverse of the latter, ξ = 1/α, because the consistency of their operation is proven using the fact that the regularly varying distributions comprise exactly the maximum domain of attraction (MDA) of the Fréchet distribution, which has ξ > 0 as its main parameter, and which is the limit distribution of the rescaled maximum value among n i.i.d. samples from a regularly varying distribution with exponent α, see Appendix B for details.
While the Hill estimator is consistent under the assumption that a given sequence is sampled only from a regularly varying distribution, the two other estimators are consistent for degree sequences sampled from any distribution belonging to the MDA of any extreme-value distribution. In fact, there exist only three families of extreme-value distributions. In addition to the Fréchet family with ξ > 0, there are only the Weibull and Gumbel families with ξ < 0 and ξ = 0, respectively. Roughly, the MDA of the Weibull family consists of distributions with upper-bounded supports, while the Gumbel MDA contains all other distributions, which can be either lighttailed or heavy-tailed, but not regularly varying. Appendix B contains all the details.
The key point here, which we rely upon in the next section, is that if the estimators, applied to a particular degree sequence, return either negative values of ξ, or values of ξ close to zero, then this sequence is quite unlikely to come from the Fréchet MDA, i.e., from a regularly varying distribution. Yet again, there is no way to quantify this unlikeliness rigorously, as explained in Section V.
Applied to n data samples x 1 , x 2 , . . . , x n , the estimators operate by first sorting the data in non-increasing order, x (1) ≥ x (2) ≥ . . . ≥ x (n) , and then limiting their consideration only to the κ largest data samples x (1) , x (2) , . . . , x (κ) , where κ is a free parameter. Since the κ-th order statistic is a random variable representing the κ-th largest element among n i.i.d. samples from a distribution, the κ parameter is known as the number of order statistics. The estimators thus operate only on the κ-tail of the empirical distribution represented by the κ order statistics. Given this tail, different estimators provide different expressions, documented in Appendix B, for the estimated value ξ κ,n of ξ, which depends on κ. These expressions rely on different aspects of the order statistics contained in the tail, but all these expressions are consistent, meaning that ξ κ,n → ξ as κ, n → ∞, κ/n → 0, (8) for all the estimators. The convergence above is usually in probability, although in some cases some stronger results, such as almost sure convergence or asymptotic normality, are available under additional assumptions on the data. It is important to note here that in proving this convergence, the number of order statistics κ cannot be fixed, it must diverge with n to incorporate more and more data in the tail, so that the estimated value of ξ is less and less affected by the fluctuations in the tail. Yet κ cannot be equal to n either, since in this case the estimated ξ would be affected by the slowly varying function (k). This implies that if applied to finite-size data samples, these estimators will not give a good estimate of ξ for either small and large values of κ. One option to deal with this problem in practice is to investigate the plot of ξ κ,n as a function of κ in order to find the value of κ where this function is "most flat/constant." This subjective approach can clearly not be rigorous. Worse, on real-world data, these functions can behave violently, see for instance the figures in Chapter 4 of [24] or in [55], so that finding such a flat region of ξ κ,n may be quite problematic.
Fortunately, for the three estimators that we consider, the double bootstrap method documented in Appendix C is proven to find the optimal value κ * of κ. Optimality means here that the error between the estimated and true values of ξ is minimized, Appendix C. The double bootstrap method is also proven not to break consistency, meaning that as a function of n, the value of κ * n diverges sublinearly, so that in view of (8), the estimated value of ξ, ξ κ * n ,n , converges to the true ξ: ξ κ * n ,n → ξ as n → ∞.
In addition to the Hill, Moments, and Kernel estimators, the Pickands estimator [56] and its generalized versions [57] are also often considered. However, only for one of these generalizations has the double bootstrap method been proven to be consistent, Appendix B. Worse, in application to real-world data, the Pickands estimator has been shown to be unstable and volatile [57,58] and to have poor efficiency [45,59]. Many other ξ-estimators exist, see [60] for a review, but the proofs of consistency of the double bootstrap method are available only for the Hill, Moments, Kernel, and Pickands estimator.
Therefore, to the best of our knowledge, the Hill, Moments, and Kernel estimators are the maximal subset of consistent, stable, and efficient estimators, for which the double bootstrap method that automatically determines the optimal value of κ, is proven to be both optimal and consistent. The reason we consider not one but all such estimators is mentioned above: since as we explain in Section V there can be no hypothesis test to tell whether a given degree sequence is an i.i.d. sequence sampled from a regularly varying distribution, the best one can do is to consider as many consistent estimators as possible, testing as many different aspects of the degree sequence as possible, and see if they agree in their estimations [24].

IV. VALIDATION OF ESTIMATOR PERFORMANCE
In Appendix D we perform an in-depth validation of all the three estimators based on extreme value (EV) theory from the previous section. We apply them to a collection of random sequences sampled from various distributions, as well as to degree sequences in three popular network models-the configuration model, preferential attachment, and random hyperbolic graphs. We also juxtapose these validation results against the performance of the PLFit algorithm from [18,19].
The results of these experiments are as expected: all the EV estimators converge to the true value of ξ if the distribution is regularly varying, and they do not converge if it is not. They also converge even in the case where we sample not from a fixed regularly varying distribution, but from a sequence of distributions that are not regularly varying but that converge to a regularly varying distribution-the case with a Pareto distribution with the diverging natural exponential cutoff. On degree sequences in network models where individual degrees are not i.i.d. samples from a fixed degree distribution, the estimators converge as well, even though the i.i.d. assumption no longer holds.
As far as the PLFit is concerned, we find in Appendix D that if the sample distribution is sufficiently "nice," then the estimation accuracy and convergence rates of the PLFit are comparable to those of the EV estimators. However, in cases where the distribution is not so nice and less pure a power law, the performance of the PLFit is substantially worse than that of the consistent EV estimator triplet. Specifically, the PLFit converges much worse than the EV estimators on distributions that can be fitted by power laws with wrong exponents in the region of small degrees, see Appendix D for details.
Remarkably, one example of this kind includes one of the most basic models of power-law networkspreferential attachment. This explains the perplexing finding in [19] that preferential attachment networks are not power-law, contradicting the basic well-known fact that they are. However, this abnormal behavior of the PLFit should not come as a surprise, because the PLFit is likely to be consistent only on pure power laws and possibly on regularly varying distributions with slowly varying functions (k) converging to constants, although none of this has been proven, as opposed to the EV estimators that have been already proven to be consistent for any slowly varying (k).
For these and a number of other lower-level technical reasons, all documented in Appendix D, we exclude the PLFit from the subsequent considerations, and do not recommend to use it in application to any real-world data, especially because there can be no way of knowing or learning the (k) function from any finite degree sequence coming from a real-world network, and so there is no way of knowing how wrong the PLFit estimation is on such a sequence. The impossibility of learning (k) is a consequence of the impossibility of testing the hypothesis that a given finite degree sequence is sampled from a regularly varying distribution with some (k). We explain this extremely important but commonly overlooked point in detail next.

V. POWER-LAW NETWORKS
There is no way to test the hypothesis that any given number was sampled from any given distribution that contains the number in its support. Yet if one has a long sequence of numbers, then there is a multitude of hypothesis testing procedures to measure how likely it is that this sequence was sampled from the distribution. The longer the sequence, the more reliable such procedures are, and any good procedure will give a definitive answer as the sequence length approaches infinity. This statistical methodology is widely known to work not only for a fixed distribution, but also for many parametric families of distributions. In the latter case, the testing involves one additional step: the parameters of the distribution are first to be estimated from the sequence using a consistent estimator.
A variation of this standard approach is at the core of [18,19], where the parametric family of distribution is the zeta/Pareto distributions. Their parameters, the tail exponents, are estimated using a combination of the likelihood maximization and Kolmogorov-Smirnov (KS) distance minimization techniques documented in Appendix D. Finally, the hypothesis testing procedure is the KS test, yielding a popular p-value number reflecting roughly how likely a given sequence comes from the pure power law with the estimated exponent.
We now come to the key point that this or any other hypothesis testing approach is not, and cannot be, applicable to regularly varying distributions, simply because these distributions are not a parametric family of distributions. Instead, they are a nonparametric class of distributions of an asymptotic nature with an uncountably infinite number of "degrees of freedom" contained in the slowly varying functions (k) (Appendix A). These functions can be arbitrarily bad, breaking pure power laws for any arbitrarily large number of degree samples or range of degrees, while the true tail of the distribution can be inferred only in the limit of infinitely long sampled sequences, which one never has in practice.
To illustrate this extremely important point, we con-sider several examples, first of some artificial/adversarial nature, and then some studied networks and models. The first example is a regularly varying distribution with support on R + and PDF with γ > 2, α = γ −1, and constant X > 0, In words, this distribution is uniform on [0, X], and Pareto for x > X. It is regularly varying for any constant X > 0 because for x > X the slowly varying function of its CCDF (x) is a constant. Suppose now that X 1. It is clear that if we sample n < X α samples from this distribution, then there is no way to infer from these samples that this distribution is regularly varying with exponent γ, because almost all samples will be from the uniform part of distribution. Only when the number of samples becomes n X α 1 can we expect to discover that this distribution has a power-law tail. Clearly, one can replace the uniform part of the distribution with an arbitrary function, thus reflecting the reality of degree sequences observed in many real-world networks much more.
As another example, consider the Pareto distribution with a fixed exponential cutoff at X > 1: .
This distribution is not regularly varying, yet if our sample size is n < X α , where α = γ − 1, then we will be tempted to conclude that it is, and that the exponent is γ, because almost all samples will be from the Pareto part of the distribution. Only if n X α will we see that this distribution does not have a power-law tail.
To see that such deceiving situations can occur in quite reasonable network models and even real-world networks, we refer to two examples. The first one is superlinear preferential attachment with the probability of connections from new to existing nodes of degree k proportional to k δ , where δ > 1. For any such δ the limit degree distribution is not regularly varying: the number of nodes with degrees exceeding a certain fixed threshold is finite [61]. Yet this threshold becomes larger if δ approaches 1. The threshold is also a growing function of the average degreek, i.e., the number of links that new nodes establish. More importantly, the larger this threshold, the more slowly the degree distribution approaches its limit, appearing as a reasonably "clean" power law in its vast preasymptotic regime. For example, for δ = 1.15 andk = 4, there are no noticeable deviations from this seemingly pure power-law behavior until the network size reaches about 10 17 [61].
The other example is the causal set of the universe, a real-world network representing the Lorentzian structure of spacetime of our accelerating universe at the Planck scale [62]. The degree distribution in this network is a double power law in the limit of the infinite age of the universe: the low-and high-degree portions of the distribution have exponents γ = 3/4 and γ = 2, respectively, so that the overall distribution is regularly varying with exponent γ = 2. However, for any finite age of the universe, only a part of this limit distribution is present. In particular, in the present-day universe the high-degree portion is not present at all, directly related to the unexplained "why now?" coincidence in cosmology that the (dark) matter and dark energy densities in the universe are of the same order of magnitude in Planck units today.
Only when dark energy starts dominating in the future, and the universe grows at least twice older, the highdegree γ = 2 portion of the degree distribution will start appearing [62]. By that time the size of the network corresponding to the observable part of the universe will be about 10 247 , the observable spacetime volume in Planck units.
All these examples illustrate the point that based on any given finite degree sequence (of a real-world network), there is absolutely no way to tell how likely the hypothesis is that this sequence was sampled from a regularly varying distribution. In view of this impossibility, the best strategy one can follow is to simply rely on the estimates of the consistent estimators discussed in the previous section [24]. If the estimates of ξ that these estimators report on a given sequence are all positive, then it might be the case that this sequence comes from a regularly varying distribution. Yet if these estimates are negative or close to zero, then the chances of that are slim. However, there is absolutely no, and cannot be any, rigorous way to quantify these chances, using p-values or any other methods, for the reasons above. This is the key point in the paper.
In view of these considerations, we take a conservative approach, and propose the following definition of powerlaw networks, based on the values of ξ that the three estimators from the previous section return upon their application to the degree sequence in a given network: In purely intuitive and non-rigorous terms, the larger the ξ, the more likely it is that the degree sequence comes from a distribution with a power-law tail. These chances are the smaller, the closer the positive ξ is to zero, and we take a conservative approach to doubt that the network is power-law if ξ ≤ 1/4. If ξ ≤ 0, these chances are really slim. Unfortunately, as discussed above, it is principally impossible to attach any rigorous quantifiers to this intuition.
We also note that the choice of the hardly-power-law ξ = 1/4 threshold is completely arbitrary, and in view of the considerations above we should not have defined any hardly power-law networks, and call a network power-law if all ξs are positive. Yet if ξ = 0.01, for instance, then γ = 101. To call a network with such γ power-law is an unseen stretch of terminology. To avoid this, we simply select the largest value of γ that is known to us to still matter. That is, we are unaware of any value of γ that would correspond to any critical point, and that is larger than γ = 5 in the Curie-Weiss model [63].
Another aspect of our conservatism is that the values of ξ > 0 close to zero do not really exclude the possibility that the distribution is not heavy-tailed. The lognormal distribution, for example, is not regularly varying, but it is heavy-tailed and belongs to the Gumbel MDA with ξ = 0, see Appendix B. Yet this MDA contains also light-tailed distributions, so that one cannot really tell whether the distribution is light-tailed or heavy-tailed if ξ = 0. On the other hand, a positive ξ does tell us that the distribution is regularly varying, but it is difficult to distinguish betweenξ converging slowly to zero and just being quite small, so that γ is quite large, which is yet another motivation to define the hardly power-law regime.
Finally, power-law networks with divergent second moments of their degree distributions, meaning γ < 3 and ξ > 1/2, are particularly interesting for a variety of reasons. For example, they are particularly robust thanks to the absence of the percolation threshold [6], they are ultrasmall worlds versus small worlds [64], the degree correlations in them are unavoidable due to structural constraints [65], etc. Therefore, we also define a subclass of power-law networks with divergent second moments of their degree distributions: A power-law network has a divergent second moment (DSM) if all the estimators return values of ξ > 1/2.

VI. REAL-WORLD NETWORKS
Here we apply the Hill, Moments, and Kernel estimators to a collection of degree sequences in real-world networks from the KONECT database [42]. The database is a curated collection of real-world networks categorized by several network attributes such as link directionality, weights, network size, etc. The database uses a unified edge list format to store the data, which simplifies the automation of data processing. Better yet, the database allows to sort networks by their properties, and to filter out networks with possibly incomplete information. This is in contrast with other large network collections, such as ICON [66], that link their entries to third-party The shown network names are their codenames used in the KONECT database [42], and they also appear in Tables I-III. Each panel shows the empirical complementary cumulative distribution functions (CCDFs) F (k) of the degree sequences in the loglog scale. The straight lines visualize the estimated values of the CCDF exponents α = 1/ξ. The filled circles are the optimal values of the number of order statistics κ * found by the double bootstrap method. The estimators operate only on degrees larger than κ * . The estimated values of α areα = 1/ξ(κ * ), whereξ(κ) is the estimated value of the tail index ξ as a function of κ. For non-positive values ofξ(κ * ), theα is undefined, so that the legends in panels (c,f,i) showξ =ξ(κ * ) instead.
databases of various formats and origins, which makes it quite difficult to collect and process the data in an automated manner.
To streamline data processing, we do not consider net-works in the database that are not downloadable in the KONECT edge list format. We also ignore temporal networks to avoid arbitrariness in selecting the temporal scale for data aggregation. Among database entries that possibly represent the same real-world network (for example, Wikipedia (EN) hyperlinks and Wikipedia (EN) links, both representing the English Wikipedia), we select only one entry. We also ignore networks that are marked as incomplete in the database. Finally, since the estimation of ξ cannot be reliable for networks of a small size, we only consider networks larger than n = 1000 nodes. The KONECT database contains not only undirected networks, but also directed and bipartite. For the latter two classes, we obtain not one, but two degree sequences for each network: the in-and out-degree sequences for directed networks, and two degree sequences for each of the two types of nodes in bipartite networks. We also remove all self-loops and multi-edges from all the collected networks. After these filtering steps, we are left with 115 networks of three different types: undirected (35), directed (49), and bipartite (31). The degree sequences of these networks are available at the software package repository [41].
We then feed the obtained degree sequences to the three estimators. Figure 1 shows the exponent estimation results that the estimators produce on some paradigmatic real-world networks in different domains, while Tables I,  II and III contain the full lists of these estimations for the undirected, directed and bipartite networks respec-tively. We see that many networks that were found to be power-law in the past are classified as such by these estimators as well. These include the Internet, WWW, human protein interactions, social group memberships, citations, and product recommendation networks. The other way around, the networks that are known not to be power-law are classified as not power-law-the California road network or the out-degree distribution in the directed network of Amazon product recommendations, for instance.
We emphasize again the importance of using as many consistent estimators as possible: on any finite degree sequence, different estimators are not guaranteed to return the same ξ-estimation, as they may explore different parts of the distribution, especially if the slowly varying function (k) is non-trivial, Appendix C. That is why we use the maximal subset of stable and efficient estimators for which the double bootstrap method to determine the optimal number of order statistics κ * is proven to be consistent.
Finally, Figure 2 summarizes the estimation results for γ = 1 + 1/ξ in Tables I-III by classifying all the considered networks into the not power-law (NPL), hardly power-law (HPL), and power-law (PL) classes, the latter containing the subclass of power-law networks with divergent second moments (DSM), defined in the previous section. We see that the percentages of power-law and DSM undirected networks are 49% and 29%, respectively. Among the considered directed networks, 24% and 6% have both in-and out-degree sequences that are powerlaw and DSM, while 82% and 45% of these networks have either in-or out-degree sequence which is power-law and DSM, with a majority of those being in-degree sequences. The bipartite networks exhibit a similar picture: 35% and 13% of them are power-law and DSM according to both types of nodes, while 74% and 55% are power-law and DSM according to at least one type of nodes.
While one cannot directly compare these results to the ones in [19], they present quite a different picture than painted there. Simply put, one can hardly call scale-free networks rare by any stretch of what rare might mean.

VII. CONCLUSION AND DISCUSSION
In summary, even though there are many hypothesis tests to tell whether a given degree sequence comes from a pure power law, that is, a Pareto or zeta distribution, our key point is that there is no, and cannot be any, hypothesis test to tell whether a given finite sequence comes from an impure power law, i.e., a regularly varying distribution. If one relies on an appropriate hypothesis test to look for real-world networks in which the degree sequences follow pure power laws, as it is done in [19], then s/he is doomed to rarely find such heavenly purity, as there can hardly be any reasons for its presence in earthly complex systems. However, if we allow for impurity in these systems, then we face a rather different picture as presented here.
The impossibility of hypothesis testing for regularly varying distributions is the reason why one cannot attach any statistical weight, such as p-values, to the statement that a given finite sequence is regularly varying or not. Yet many other aspects of the current state of affairs in statistics related to detecting power laws in empirical network data does allow for improvement, so that we comment on some of them here.
First, the existing consistent estimators for such inferences are based on extreme value (EV) theory. These estimators cannot differentiate between heavy-tailed and light-tailed distributions, simply because the maximum domain of attraction of the Gumbel EV distribution contains distributions of both types-the light-tailed normal and heavy-tailed lognormal distributions, for example, Appendix B. Since for many applications in network science an important question is whether a degree distribution is heavy-or light-tailed, versus regularly varying or not, it is of particular interest to devise other estimators, not based on EV theory, that would be capable of differentiating between these two types of distributions.
Yet even for EV-based estimators there are many paths to improve their applicability and rigorous guarantees. For example, it would be nice to relax the i.i.d. assumption for these estimators, and to prove their consistency in application to network models. The first step in this direction was made in [46]. We saw in our experiments in Appendix D that all the considered estimators converge in all the considered network models, but there are no proofs for this convergence for any network models other than preferential attachment, to the best of our knowledge.
Another important open problem is the convergence speed. All we currently know is that the considered estimators converge to the true value of the power-law exponent γ on sequences of random numbers of increasing length n sampled i.i.d.'ly from any regularly varying distribution with this γ, but we do not know how quickly this convergence occurs. The speed of this convergence likely depends not only on γ but also on the slowly varying function (k). Thus, the problem is to obtain bounds, as functions of n, on the error of estimation of γ for a given γ and (k). Can such bounds be obtained for certain classes of (k)s?
More pertinent to networks, and also closely related to the convergence speed, is the question of dealing with not one sequence but with sequences of sequences. For some real-world networks there exist data not only on one snapshot of a network but also on historical series of such snapshots. In this case, we have not one degree sequence but a series of degree sequences. One can then apply the estimators to these series, obtaining a series of estimates. Given such estimate series and the length of the sequence attached to each element of the series, i.e., the network size, can one extract any additional information about the convergence of the series, and possibly devise some tests of the hypothesis that the series comes from a regularly varying generative process? To the best of our knowledge, these questions are wide open.
Another network-specific issue is that degree sequences are integer-versus real-valued, and the considered estimators are known to be unstable and to converge quite slowly in the case of integer-valued regularly varying distributions, Appendix C. We circumvent this issue in our experiments by adding symmetric uniform noise, but it would be nice to design estimators that work reliably on integer-valued data directly.
Another down-to-earth issue is the second order condition assumed by the double bootstrap method, Appendix C. This condition is violated by vanilla power law distributions such as the Pareto or zeta distributions. We saw in our experiments in Appendix D that the estimators converge in these cases as well, but there are no proofs that the double bootstrap method is optimal in these cases.
Finally, we comment on the important issue of cutoffs that often causes much confusion. Here we have to differentiate between many possibilities of what a cutoff might mean. Two classes of such possibilities are finite-size effects and true cutoffs. In the first case, a cutoff is just an illusion due to a finite system size. If one samples an insufficiently large number of i.i.d. samples from a regularly varying distribution, the empirical distribution of these samples may appear to have a cutoff, even though the distribution, by definition of regularly varying, does not have any cutoffs. In simple terms, the tail of the empirical distribution may bend downwards, but this effect is simply due to the insufficient number of samples. In such cases, if one explores the empirical distribution tail, s/he finds only a few data points there. We note that in such cases, the EV theory gives not only the expected value of the maximum among these samples, but also its distribution, Appendix B.
In networks, however, this maximum can simply not be greater than n, the degree sequence length equal to the network size, and there are other kinds of degree correlations and degree sequence constraints that are forced by the network structure, many documented in [65], for instance. These constraints can be such that the degree distribution does have true cutoffs. More generally, it may very well happen that the process driving the evolution of a given network is such that its degree distribution does converge to a distribution with true cutoffs, sharp or soft. Examples are the preferential attachment model with a preference kernel which is constant above a certain degree threshold [67,Section 4], or the causal set of the universe [62] from Section V. In these cases, one has to further differentiate between the following two possibilities. First, the cutoff can be constant, that is, independent of the network size/degree sequence length. In this case, the distribution is not regularly varying by definition, so that one cannot call it power-law.
But the other possibility is that the cutoff diverges with the network size. In this case we have a scenario that can possibly be modeled by random sequences of varying length n sampled from a sequence of distributions parameterized by n. If their cutoff diverges with n, then the latter sequence may or may not converge to a regularly varying distribution in the n → ∞ limit. In Appendix D we considered an example of this sort, diverging natural exponential cutoffs, where the n-dependent distributions do converge to the regularly varying Pareto distribution. We saw there that even in this case, the considered estimators converge to the true values of γ, even though the key assumptions behind the proofs of their convergence are violated. Proving the consistency of these and other estimators for sequences of random numbers sampled from sequences of distributions converging to regularly varying distributions, is thus yet another open problem.
Notwithstanding these open problems, the consistent estimators considered in this paper represent the current state of the art in detecting power laws in empirical data. Their implementation is available in [41], and their application to a representative collection of degree sequences in real-world networks reveals a picture quite different from the one painted in [19]. ACKNOWLEDGMENTS We thank S. Resnick Here we briefly review the taxonomy of distributions with heavy tails and provide the definition of the simplest and most frequently seen regularly varying distributions. All the distribution classes mentioned here are characterized by the key property that their tails decay more slowly than exponentially. The most general class is that of the heavy-tailed distributions. We focus on distributions with support on R + . Chapters 2 and 3 in [26] contain further details. In words, this definition literally says that the tail of the distribution F (x) decays more slowly than exponentially.
The class of heavy-tailed distributions is quite vast and general which makes it rather difficult to work with them in their full generality. Therefore, many different narrower and more tractable subclasses of heavy-tailed distributions have been defined and studied, see Figure 3 for an overview of the landscape of heavy-tailed distributions. For completeness, we briefly discuss two important subclasses that encapsulate regularly varying distributions, which are our main interest.
a. Long-tailed distributions. A distribution with CDF F (x) is called long-tailed [26, Definition 2.21] if its CCDF satisfies, for any fixed y > 0, where 1 is the indicator function. Indeed, for any t > 0 lim sup so that f is not long-tailed. b. Subexponential distributions. Let (F * F )(x) be the convolution of CDF F (x) with itself. That is, F * F is the CDF of X + X , where X and X are independent random variables with CDF F . A distribution with CDF F (x) is said to be subexponential [26,Definition 3 This definition means that if X and X are independent samples from a subexponential distribution, then the CCDF of X + X is asymptotically twice larger than the CCDF of the original distribution. This property implies, for instance, that if the sum n i=1 X i of n independent samples from a subexponential distribution exceeds some large threshold, then it is because just one X i has exceeded this threshold. This is in contrast to independent samples from a Poisson distribution, for instance, as their sums exceeding a large threshold do not contain, with high probability, any terms exceeding this threshold.
The class of subexponential distributions is contained in that of long-tailed distribution [26, Lemma 3.2], hence they are heavy-tailed. In fact, it is strictly contained. However, unlike the case for heavy-tailed versus longtailed distributions, examples of long-tailed distributions that are not subexponential are more involved, see Section 3.7 in [26].
Our main interest is in regularly varying distributions, which form a subclass of subexponential distributions [26,Theorem 3.29]. This hierarchy endows regularly varying distributions with all the nice theoretical properties of the subexponential and long-tailed ones, but in contrast to these more general classes, regularly varying distributions are equipped with a concise and tractable representation that makes them very convenient to work with in statistical inference settings.

Regularly varying distributions
A function f (x) is said to be regularly varying at infinity with index α [25,26] if there exists a slowly varying function (x), such that where a slowly varying function (x) is defined to be a function such that it satisfies, for any t > 0, The simplest examples of slowly varying functions are functions converging to constants or log a (bx) for any a ∈ R and b > 0. The full class of slowly varying functions is of course much richer, and it is fully characterized by Karamata's Representation Theorem [24, Corollary 2.1] stating that for some functions c, ε : The theory of regular variations is a rich and welldeveloped one, and for further details we refer to [25].
A distribution is defined to be regularly varying if its CCDF F (x) is a regularly varying function. In Section II we also define a power-law distribution to be a regularly varying distribution.
We note that if the PDF of a distribution is regularly varying, then so is its CCDF with another slowly varying function, according to Karamata's theorem [24,Theorem 2.1] in the case of continuous distributions, and to [68,Lemma 9.1] in the case of discrete ones. The converse is not generally true, and depends on the exact form of the slowly varying function (x). A simple example of the distribution whose PDF is not regularly varying but CCDF is, is given by the PDF P (x) = c sin(x) 2 x −3 with support x ≥ 1, and c the normalization constant. This PDF is not regularly varying since (x) = c sin 2 x is not slowly varying. However, the CCDF of this distribution F (x) = (x) x −2 , where (x) = (c/2) sin 2 x + x sin(2x) − 2x 2 Ci(2x) and Ci(x) = − ∞ x cos(t)/t dt is the cosine integral, is regularly varying because (x) is slowly varying: it converges to the constant c/4 at x → ∞.
As a subclass of heavy-tailed distributions, regularly varying distributions can model data with high variability, yet here we stress again that they are far from being as general as heavy-tailed distributions, which means, in particular, that if a given data fails to be regularly varying, it does not necessarily mean that it is not heavytailed or even subexponential. The simplest example of a subexponential distribution which is not regularly varying is the lognormal distribution. Yet on the other hand, regularly varying distributions are a vast generalization of pure power laws exclusively considered in [18,19], i.e., of the Pareto distribution if x is continuous, or of generalized zeta distribution (2) if x is integer-valued.

Simplest examples of regularly varying distributions
To make the definition (A3) more concrete, here we give the simplest examples of regularly varying distributions, both continuous and integer-valued ones.
The ultimately simplest example is the continuous Pareto distribution with scale x * and shape α, or exponent γ = α + 1: Here the slowly varying function is simply the constant Pareto (x) = α(x * ) α , which does not vary at all. There are two simple ways to turn a continuous regularly varying distribution into a integer-valued one, both of which again belong to the class of regularly varying distributions. In the first example we simply take the integer k to be the floor of the continuous value where floorP (k) converges to α(x * ) α as k → ∞. We note that in this example the slowly varying function is not a constant. Yet it approaches a constant asymptotically. The second example is a mixed Poisson distribution [69] with Pareto mixing. The easiest way to define a mixed Poisson distribution is via the procedure to sample from it: as its name suggests, first sample x from the Pareto distribution, and then sample k from the Poisson distribution with mean x. The resulting PDF of k is thus where Γ(k, x) is the upper incomplete Gamma function. The function mPois (k) is slowly varying, and its k → ∞ limit is, as in the previous example, α(x * ) α . Mixed Poisson distributions appear often as exact degree distributions in network models with hidden variables [70], also known as inhomogeneous random graphs [71], or more generally, graphon-based W -random graphs [72]. Both the expected value and the tail exponent of mixed Poisson k are equal to those of Pareto x, versus floored Paretos in which the expected value of k is k = ζ ( x ), where ζ is the Riemann zeta function.

Appendix B: Consistent estimators for tail exponents of regularly varying distributions
Here we give the definitions of the three consistent estimators of the tail of a regularly varying distribution that we use to infer the tail exponents in synthetic and real-world degree sequences. The two other consistent estimators that are also included in our software package [41] are also defined here.
Although we work only with regularly varying distributions, the used estimators are actually designed to estimate the index of an extreme value distribution. In fact, the consistency results are proven under the assumption that the distribution belongs to the maximum domain of attraction of an extreme value distribution. It turns out that any regularly varying distribution satisfies this assumption. Therefore, we start with a brief review of extreme value distributions and their maximum domains of attraction, and then explain how these concepts are employed by the consistent estimators of tail exponents.

Extreme value distributions and their maximum domains of attraction
Let x 1 , . . . , x n be an i.i.d. sequence sampled from some distribution P (x), and denote by m n = max 1≤i≤n x i the largest value in the sequence. Extreme value theory is concerned with the properties of the distribution of m n , whose CDF is given by the order statistics F n (x). The typical question is whether there is a non-degenerate limit law, i.e., not a delta-function distribution, for µ n = (m n − d n )/c n for some appropriately chosen n-dependent constants c n > 0 and d n ∈ R. A degenerate limit for µ n exists for any distribution as one can always select d n = 0 and any c n growing with n faster than the expected value of m n , in which case the distribution of µ n would approach the delta-function distribution centered at zero. However, a non-degenerate limit exists [ where X F = sup{x ; F (x) < 1} is the right endpoint of the distribution, which can be infinite, and F (x−) = lim t→∞ F (x − 1/t) is the left limit of F at x. In words, this requirement states that F (x) must be sufficiently flat at its right end and must not jump there.
which tends to 0 as k → ∞, while for the geometric distribution with success probability p, F (k)/F (k − 1) → 1 − p, violating (B1) in both cases.
If a distribution P (x) does satisfy (B1), so that a nondegenerate limit distribution of µ n = (m n − d n )/c n does exist, this latter distribution P(µ) is called an extreme value distribution [73,74]. An important result [75] (see also [73,Proposition 0.3]) states that extreme value distributions are parameterized by an index parameter ξ ∈ R, and that the class of extreme value distributions consist of just three subclasses-Fréchet, Gumbel, and Weibull distributions-corresponding, respectively, to ξ > 0, ξ = 0, and ξ < 0. The CDFs F(µ) of these three distributions can be grouped into the CDF of the generalized extreme value distribution where l ∈ R and s > 0 are known as, respectively, the location and scale parameters. The supports of the distributions are ν ≥ −1/ξ for ξ > 0, ν ≤ −1/ξ for ξ < 0, and ν ∈ R for ξ = 0. A distribution P (x) is said to belong to the maximum domain of attraction (MDA) of an extreme value distribution P(µ) if their exist n-sequences of constants c n > 0 and d n ∈ R such that the distribution of µ n = (m n − d n )/c n converges to P(µ). The crucially important fact, originally proven in [76], is that the regularly varying distributions are exactly all the distributions comprising the MDA of the Fréchet distribution, see also [73,Proposition 1.11] and [74,Theorem 1.4.20], so that any regularly varying distribution with PDF and CCDF tail exponents γ and α belongs to the MDA of a Fréchet distribution with index The sequences d n and c n in this regularly varying/Fréchet case are, [ with support µ ≥ 0. If the distribution P (x) is Pareto, for example, then related to the known observations that the expected maximum degree in power-law networks with exponent γ is proportional to n 1/(γ−1) [65]. We note that the expressions above specify not only the expected values but also the full limit distributions of such maxima. We also note that the mean of the limit Fréchet distribution F(µ) = e −µ −1/ξ is µ = Γ(1 − ξ) if ξ < 1 (γ > 2), and that this mean is infinite if ξ ≥ 1 (γ ≤ 2), so that if γ > 2 and n is large, one can approximate the expected value of m n in Pareto by To complete the picture of the classification of distributions based on their MDAs, the MDA of the Weibull distribution consists of all distributions with an upperbounded support, X F < ∞, and CCDFs satisfying F (X F − 1/t) = (t)t 1/ξ for t → ∞, some slowly varying function (t), and ξ < 0, which is the same ξ as in (B2) [ The key point, however, is that if a distribution is regularly varying with tail exponent γ, it is in the MDA of the Fréchet distribution with index ξ = 1/(γ − 1) which all the following estimators actually estimate.

Hill's estimator
Hill's estimator [43] was introduced to analyze the tail behavior of a distribution without any assumptions about its shape, other than it belongs to the Fréchet MDA. Given an i.i.d. sample x i , i = 1, . . . , n, and its order statistics x (1) ≥ x (2) ≥ · · · ≥ x (n) , the estimator is defined by Theorems 4.1 and 4.2 in [24] prove that if κ/n → 0 and κ → ∞ as n → ∞, then this estimator is statistically consistent, i.e., satisfies (8), for any distribution in the MDA of Fréchet distribution. In other words the estimator is statistically consistent for any regularly varying distribution with any tail exponent γ > 1, or equivalently index ξ > 0.

Moments estimator
The moments estimator [44] is a modification of Hill's estimator that is statistically consistent not only for distributions from the MDA of the Fréchet distribution, but also for distributions from the MDAs of the Gumbel or Weibull distributions, i.e., for any ξ ∈ R. To define it, denote With this notation, the Moments estimator is . It converges almost surely if κ/n → 0 and κ → ∞ as n → ∞, and there exists a constant δ > 0 such that log(n) δ /κ → 0.

Kernel estimator
Similar to the Moments estimator, the Kernel estimator [45] is consistently applicable to distributions with any ξ ∈ R. As its name suggests, the Kernel estimator uses a kernel, which is a function φ : [0, 1] → [0, ∞) that can by chosen by the user, and that must satisfy a set of conditions for the estimator to be consistent [45]. The estimator also employs a parameter λ > 1/2 to get rid of possible singularities. Finally, instead of using an integervalued κ to determine the range of the order statistics to consider for ξ-estimation, the estimator relies on a continuous bandwidth parameter h > 0 for that purpose. Thanks to this modification, as a function of h, the estimator tends to be smoother compared to the other estimators.
Given the chosen kernel φ, denote φ h (u) := φ(u/h)/h. With this notation, the Kernel estimator is The consistency of this estimator for n → ∞, h → 0, and hn → ∞ is proven in [45].
For the experiments in this paper, which are also the default settings in [41], we prepare a list of fractions of order statistics h 1 , . . . , h s that are logarithmically spaced in the interval [1/n, 1], where n is the sequence length, and s = [0.3n]. The logarithmic binning is chosen to scan the tail of the degree sequence more densely. The estimator is then evaluated at each h-value h i , i = 1, . . . , s. For λ, we use the setting in [45] where λ = 0.6. In our software package [41] the values of λ and s can be changed to any other values λ > 1/2 and s > 0. Package [41] also implements the bi-and tri-weight kernels from [45]: where φ (1) is used for the tail estimation, and the combination of φ (1) and φ (2) is used to find the optimal h * as described in Section C 1. Once such an h * is found, the value of κ * is set to nh * in [41].

Smooth Hill estimator
Although Hill's estimator is consistent, it, as a function of the number of order statistics κ, can be highly irregular for finite-size data samples. The plots of such functions are even known as Hill Horror Plots, Section 4.4.2 in [24]. These horrors make it essentially impossible to examine these plots in search of the stable regime of ξ Hill κ,n , i.e., the region of κs where ξ Hill κ,n is approximately constant. The value that the estimator yields in this constant regime is then one's best estimate of ξ, but if these plots are highly irregular, then this estimation procedure is unavoidably subjective. Even though no results presented in this paper rely on such subjective manipulations-instead we rely on the statistically consistent double bootstrap method to find κ * , Section C 1in practice one may wish to investigate such plots to get deeper insight into the data at hand. For this end, one usually uses either the smoothed version of Hill's estimator or Pickands estimator, which are both included in [41].
The smooth Hill estimator [77] is defined for any integer r ≥ 2, which is a parameter, by which is just an average of Hill's estimators over the range [κ + 1, rκ]. This estimator is also statistically consistent, for any r, as proven in [77]. The practical advantage of this smooth estimator compared to the original Hill's estimator is that by averaging the latter, the former suppresses its erratic behavior, making it easier to identify its stable region.

Pickands estimator
The Pickands estimator [56] is defined by Consistency of the estimator is proven also in [56]. Similarly to the Moments and Kernel estimators, it is consistently applicable to distributions in the MDAs of extreme value distributions with any ξ ∈ R. In practice this estimator provides a simple way to check whether the assumption that the data comes from a regularly varying distribution makes sense. Specifically, if the function ξ Pickands κ,n of κ is all negative, then this assumption can hardly be true.
In contrast to the other estimators, the Pickands one has an issue dealing with integer-valued data containing ties. For instance, if there are many data points with the same value (many nodes with the same degree), it can happen that for some κ, x (2κ) = x (4κ) , in which case ξ Pickands κ,n is undefined. This drawback can however be remedied, in a provably consistent manner, by adding uniform noise to the integer-valued data, as explained in Section C 2.
It is known that in practice the Pickands estimator is quite volatile as a function of number of order statistics κ, and that it has large asymptotic variance [57] and poor efficiency [45]. Attempts to cure this poor behavior resulted in a number of different versions of generalized Pickands estimators [57,[78][79][80][81][82], all of them using the idea of linear combinations of log-spacings of order statistics κ (1) , . . . , κ (n) . Yet, to the best of our knowledge, the consistency of the double bootstrap method has been proven [83] for only one version, the one defined in [78].
Appendix C: Estimating the tail exponent of an empirical degree sequence Here we discuss the technical details concerning the application of the consistent estimators discussed in the previous section to empirical degree sequences coming from either synthetic or real-world networks.

Finding the optimal number of order statistics
All the estimators in Section B depend on the number of order statistics κ. That is, all the estimators operate only on the κ largest-value data samples (degrees). The consistency of all the estimators is proven only in the limit of both κ and n, the number of samples (nodes), going to infinity. Therefore, when applied to a finite empirical degree sequence, these estimators have the value of κ as a free parameter. The main focus of this section is the double bootstrap method that algorithmically identifies an optimal κ-value κ * in a statistically consistent manner, meaning that the value ofξ κ * ,n estimated by these estimators with κ = κ * provably converges to the true value of ξ as n → ∞.
The identification of an optimal value κ * has been an active research topic in extreme value statistics for several decades [83,84]. The existing methods for choosing κ * can be roughly split in two classes: (1) heuristic approaches that propose to study tail index estimates plotted as functions of κ, and (2) theoretical approaches based on the minimization of the asymptotic mean-squared-error (AMSE) of the estimator.
The heuristic methods mainly consider various ways of identifying regions of κ where estimators show stable behavior, i.e., where the estimator plot is relatively flat as a function of κ. Examples of such approaches are automated eyeball method [77], or picking a fixed small percentage (typically 5% or 10%) of the largest-value data samples. Such methods, involving (semi-)subjective adhoc choices, may not be robust.
The main idea behind the theoretical methods is as follows. Suppose x 1 , . . . , x n is an i.i.d. sequence sampled from a distribution that belongs to the domain of attraction of the generalized extreme value distribution (B2) with a given ξ. Denote by ξ κ,n the estimated value of ξ returned by a given estimator applied to the κ largest elements in this sequence. Observe that since the sequence is random, ξ κ,n is a random variable. Define the asymptotic mean squared error between the true and estimated ξs as The main goal is to find the optimal κ-value κ * that minimizes this error: To estimate κ * in this paper we use the AMSE-based double bootstrap method developed in [45,[83][84][85] because of its proved consistency, stability, and applicability to the considered estimators. The method finds a consistent optimal value κ * for a given consistent estimator by employing not only this estimator, but also another consistent estimator. The two estimators are applied to two collections of bootstrap samples from the original data, estimating ξ at all possible values of κ in these collections, and the value κ * is then determined as the value of κ at which the two estimators agree most in their estimation of ξ according to the empirical AMSE evaluated on the bootstrap collections.
Specifically, the double bootstrap method operates using the following steps with two parameters: r the number of bootstrap samples, and t ∈ (0, 1) defining the first and second bootstrap sample sizes n 1 = √ tn and n 2 = tn, where n is the original sequence length. In all our experiments in this paper, and in the software package [41], these parameters are set to r = 500 and t = 1/2 by default, so that the size of the second bootstrap sample is n 2 = n/2. 3. Find κ * 1 that minimizes the empirical AMSE between the two estimates with respect to the r bootstrap samples, i.e., 4. Repeat the same procedure for a smaller bootstrap sample size n 2 = [tn] and find κ * 2 in the same manner. 5. The optimal value of κ for the original data is given by: where A(κ * 1 , n 1 , n) is a pre-factor that depends on κ * 1 , n 1 , n, and whose exact form depends on the two estimators used.
Following the derivations in [45,[83][84][85], we use the following combinations of consistent estimators for the double bootstrap procedure applied to the Hill, Kernel, and Moments estimators: (1) the 1st (Hill) and 2nd moment estimators for the Hill double bootstrap; (2) the 2nd and 3rd moment estimators for the Moments double bootstrap; (3) the biweight and triweight kernel estimators for the Kernel double bootstrap. We note that in principle any combination of consistent estimators can be used in the double bootstrap method, but proofs of the consistency of such combinations must be carried out for each combination, so that we use the combinations that are already proven to be consistent and optimal.
We also note that these proofs are based on an additional assumption that the regularly varying distribution of the samples satisfies the second-order condition [86], [87,Definition 2.3.1]. To define it for a given distribution with CDF F (x), let U (x) = F −1 (1 − 1/x) be the inverse of the function (1 − F (x)) −1 . If the distribution is in an MDA of some extreme value distribution, it is known [87, Theorem 1.1.6] that there exists a positive function a(x) such that, for any t > 0, The second order condition concerns the scaling of (U (tx) − U (x))/a(x) − b ξ (t) as x → ∞. The distribution is said to satisfy the second order condition if there exist functions A(x) with lim x→∞ A(x) = 0 and a nondegenerate H(t) = cb ξ (t) for any c = 0, such that for any A simple example of a regularly varying CDF that satisfies the second order condition is where d > 0, δ > α > 0, and x ≥ x * , where x * is the root of F (x). The simplest example of a distribution that does not satisfy the second order condition is a Pareto distribution. To see this, note that in case of Pareto U (x) = x * x 1/α , so that a(x) = α −1 x * x 1/α , and Hence the left hand side in (C4) is always zero, meaning that no non-degenerate function H(t) exists. Applied to sequences sampled from distributions that do not satisfy the second order condition, all the considered estimators with the double bootstrap method are still consistent, because they are consistent for any sequence of κ → ∞ with κ/n → 0, but the values of κ * that the double bootstrap algorithm produces might be not optimal on such sequences. However, in our experiments we find that even in these cases the double bootstrap procedure performs well, and the values of ξ κ * ,n quickly converge to the true ξs as n → ∞ in most such cases, Fig. 7.
Further technical details on the double bootstrap procedure for the Hill estimator can be found in [84,85], for the Moments estimator in [83], and for the Kernel estimator in [45], where the consistency of double bootstrapping applied to these estimators is also proven.

Working with integer data
A common issue with all the known consistent estimators is their instability, i.e., erratic behavior of ξ κ,n s as functions of sampled sequences and the number of order statistics κ, on integer-valued sequences [55,88], which is the case with degree sequences. For instance, just rounding samples in sequences sampled from a continuous regularly varying distribution makes the estimators unstable [55], even though such rounded sequences are still regularly varying with the same exponent, Section A 3. In other words, the estimators remain consistent on integervalued regularly varying distributions, but they tend to be unstable and exhibit slow convergence in such cases.
To resolve this issue we simply add uniform symmetric noise to the integer-valued sequences x i , i = 1, . . . , n, that is, to all sequences considered in this paper. Specifically, instead of applying the estimators to x i , we apply them to y i = x i + u i , where u i s are i.i.d. samples from the uniform distribution on [−1/2, 1/2]. This does not affect the tail exponent: if x is a regularly varying random variable with tail index ξ > 0, and u is a uniform random variable on [−1/2 · 10 −p , 1/2 · 10 −p ], where p ≥ 0, then x + u is also regularly varying with the same exponent [88, Theorem 5.3.1]. Adding such noise improves greatly the stability and convergence of the estimators, see Fig. 4 and compare it with Fig. 7(a,f,k).

Example of the estimator operation using the double bootstrap method
To emphasize the importance of using as many consistent estimators as possible in application to degree sequences in real-world networks, here we consider an example of how the estimators work in conjunction with the double bootstrap method, showing that different estimators may explore different parts of the empirical degree The integers are fed to the estimators as is, without adding the uniform symmetric noise. The RRMSE is larger than with noise, cf. Fig. 7(a,f,k).
distribution for any finite sequence, thus explaining why they may return different estimations on such sequences, especially if the slowly varying function (k) is not trivial. Figure 5 shows that the Hill estimator yields a higher estimation of α = 1/ξ than the other two estimators applied to the in-degree sequence of the Libimseti network. This happens because the value of the optimal number of order statistics κ * returned by Hill's double bootstrap is substantially lower than for the other two estimators, so that the Hill estimator considers a smaller part of the distribution tail. The value of Hill's κ * is smaller because it is based on finding the minimum of the AMSE as a function of the number of order statistics κ, and as we can see in the figure these minima occur at quite different values of κ for the Hill versus the two other estimators.
This effect is actually expected in small-sized sequences sampled from regularly varying distributions with nontrivial slowly varying functions (k). Figure 6 shows the details behind estimator convergence in two different cases, with a "nice" and "not so nice" slowly varying function (k). The figure illustrates the point that the farther the (k) is from a constant, the larger the network size must be for all the estimators to yield similar values of κ * and ξ. The estimators are guaranteed to converge to the true value of ξ for any (k), but only in the n → ∞ limit, and to the best of our knowledge, there are no results (for the bounds) on the speed of this convergence, partly because this speed may depend in an unknown way on some properties of (k). That is why using as many consistent estimators as possible in application to real-world data is the best strategy one can follow.

Appendix D: Validation on synthetic sequences and network models
Here we show that the estimators based on extreme value (EV) theory-the Hill, Moments, and Kernel estimators equipped with the double bootstrap procedure, the code in [41]-yield the expected results if applied to synthetic degree sequences and to network models. We also compare the estimations that these estimators produce with the ones by the PLFit [18], which is based on a combination of techniques inspired by maximumlikelihood estimation (MLE) and Kolmogorov-Smirnov (KS) distance minimization. We use the plfit.m code version 1.0.11 by Aaron Clauset available at [89].

Synthetic sequences
Here we sample different numbers n of positive integers k ∈ N + from the distributions listed below. The set of chosen distributions is intended to be diverse and representative of distributions claimed to be observed in real-world networks. In cases where the distribution has support on non-negative integers k ∈ N, we discard all the zero entries from the sequence since they would correspond to nodes of degree k = 0 in networks. The parameter γ in all the distributions below can be any real number greater than 1.
Zeta distribution: The distribution PDF (or PMF, to be precise) is where ζ(γ) is the Riemann zeta function. This is the "clean" integer-valued power-law distribution with constant slowly varying function (k) = 1/ζ(γ). Pareto-mixed Poisson distribution: For each sample, we first sample a real number x from the Pareto distribution, and then sample an integer k from the Poisson distribution with mean x: where α = γ − 1, and we set x 0 = 1 in the experiments. The Pareto-mixed Poisson distribution is ubiquitous in network models with hidden variables [70], also known as inhomogeneous random graphs [71], and more generally, as graphon-based W -random graphs [72]. This is one of the simplest regularly varying distribution with nonconstant (k) that converges to a constant, (k) → αx α 0 , Section A 3.
Pareto distribution with natural exponential cutoff. We sample a random number x from the Pareto distribution with the exponential cutoff at n ξ , where E γ is the exponential integral function, ξ = 1/α, and α = γ − 1. We then round x to the closest integer k = [x]. We set x 0 = 1. The value n ξ of where the exponential decay becomes prominent corresponds to the natural cutoff [65], which is proportional to the exact expected maximum value (B6) among n i.i.d. samples from the Pareto distribution with exponent γ. This is an example of not a fixed distribution, but of an n-dependent family of distributions. For any fixed n, the distribution is not regularly varying since it has an exponential versus power-law tail. Yet as n increases, the location n ξ of the "soft beginning" of the exponential tail diverges, so that in the n → ∞ limit the distributions in this family converge to the pure Pareto distribution with exponent γ, which is regularly varying.  6. An example of convergence of the estimators on the i.i.d. sequences of two different sizes, "small" ns = 1, 000 and "large" n l = 5, 000, sampled from two regularly varying distributions with the same tail exponent γ = 3, but with different slowly varying functions (k): the Pareto-mixed Poisson distribution and double power law, see Section D 1 for details. The parameters for the Pareto-mixed Poisson are γ = 3 and x0 = 1, while for the double power law, they are: γ = 3, γ0 = 1.5, c = 500, r = 0.1, and x0 = 1. The first, second, and third rows follow the same style and notations as panels (a,b,c) in Fig. 5, respectively. One can see that while all the estimators yield similar estimates for the sequences sampled from the Pareto-mixed Poisson distribution for both small and large sequence sizes, the estimates of different estimators for the small sequence size ns are far apart in the case of the double power law distribution with "uglier" (k). In the latter case, the estimators start to agree on their estimates only for the large sequence size n l . One can also see that the main reason for this effect is that the AMSE minima occur at far-apart locations if the sample size is small and (k) is not so "nice." Pareto distribution with a constant exponential cutoff. The sampling is the same as in the previous example, except that the location of the exponential cutoff does not depend on n, and is fixed to be 100 instead of n ξ . This is an example of a distribution which is not regularly varying.
Double power law. We sample a random number x from the double power-law distribution with the PDF where c is the location of the switch between the two power laws with exponents γ 0 for x c and γ for x c, r is the parameter that controls how smooth this switch is, and β is the normalizing constant given by where α = γ − 1 and 2 F 1 is the Gauss hypergeometric function. As in all other examples, given this random x, we round it to integer k = [x]. In our experiments, we set γ 0 = 1.5, c = 500, r = 0.1, and x 0 = 1. This distribution is regularly varying with exponent γ, which we vary in the experiments. Yet, as discussed in Section V, it may be difficult for the estimators to see that it is indeed γ and not γ 0 if n is small. Distributions of this form characterize the degree distribution in the causal set of the universe [62], and they also frequently appear in astrophysics [90].
To assess the accuracy of the estimators, we sample s = 100 random sequences for each combination of the distributions listed above, the values of γ, and the numbers of samples n in a sequence. On each sampled sequence j, j = 1, . . . , s, each estimator returns an estimated value ξ j of ξ. Given a collection of these ξ j s, we compute the relative root-mean-squared-error (RRMSE), a standard metric used to assess the accuracy of the tail index estimation [91,92], defined as and show the results in Fig. 7 both for the extreme value (EV) estimators (Hill, Kernel, Moments), and for the MLE-based PLFit [18]. We observe that all the results are as expected. On sequences sampled from distributions that are regularly varying, all the EV estimators converge. They also converge in the case where the distributions are not regularly varying for any finite sample size n, but where they converge to a regularly varying distribution at n → ∞-the Pareto distribution with the diverging natural cutoff. No estimator converges in the case of a fixed distribution which is not regularly varying-the Pareto distribution with a fixed exponential cutoff.
Also as expected, the PLFit yields a lower estimation error in case of the zeta distribution. This is because the zeta distribution satisfies PLFit's main assumption of a clean power law with constant (k), but does not satisfy the second order condition, thus affecting the optimality of the double bootstrap, Section C 1. In other cases with reasonably "nice" regularly varying distributions with (k) quickly converging to a constant, the accuracy and convergence rates of the EV and PLFit estimators are comparable. However, as soon as a regularly varying distribution is not really nice-the double power law case with non-constant (k) over a wide range of degrees k-the PLFit estimations are completely off, as opposed to the EV estimators. This is also expected for the reasons discussed in Section D 3.

Network models
The main motivation to test the performance of the EV estimators not only on synthetic sequences of numbers sampled from various distributions, but also on degree sequences in network models, is to see whether and how their performance is affected by possible non-i.i.d.-ness of the latter sequences. To this end we consider three paradigmatic network models in which the degree distributions have been proven to converge to a regularly varying distribution, and in which the degree sequences are not i.i.d: 1) the erased configuration model (ECM) [93], 2) preferential attachment (PA) [94], and 3) hyperbolic random graphs (HRG) [95].
Erased configuration model. We sample varying-length i.i.d. sequences of random integers from the zeta distributions with different values of the exponent, and then either accept or reject the sequence based on whether the sum of its elements is even or odd. Each number in the sequence is the number of stubs attached to a node in a network to be formed. We match pairs of stubs uniformly at random, and then delete loops and multi-edges. For any finite sample size, the degree sequence in the resulting network is neither zeta-distributed nor i.i.d., but it converges to the original zeta distribution as the sample size tends to infinity [93,Theorem 2.1].
Preferential attachment. We use the redirection implementation in [96]: each new node picks an already existing node uniformly at random, and then connects either to it with probability 1 − r, or to its random neighbor with probability r. We use this redirection probability to control the exponent of the power-law tail of the degree distribution, because this distribution converges to the following regularly varying distribution with exponent γ = 1 + 1/r [96]: Hyperbolic random graphs. The degree distribution in random geometric graphs in hyperbolic spaces converges to regularly varying Pareto-mixed Poisson distributions (A4), and, as opposed to the previous two models, these graphs also have non-vanishing clustering [95]. We use the software package developed in [97] and available at [98] to generate these graphs. We fix the average degree parameter tok = 10, the temperature parameter  [18] applied to the network models described in Section D 2. The first, second, and third columns show the results for the erased configuration model, preferential attachment, and hyperbolic random graphs, respectively. The second, third, and fourth rows show the results for γ = 2.1, γ = 2.5, and γ = 3.0, respectively. The network size varies in the range n ∈ [10 3 , 10 5 ] everywhere.
to T = 0 corresponding to strongest clustering, and vary the γ parameter.
For each model, we vary the γ over the three values γ = 2.1, 2.5, and 3.0, and vary the network size n from 10 3 to 10 5 . For each combination of the model, γ, and n, we generate 100 random networks, read off their degree sequences, and feed them to all the considered estimators. We then compute the RRMSE (D7), and show the results in Fig. 8.
We observe that in all the considered cases, all the EV estimators converge, even though the degree sequences are not i.i.d. The slow convergence in some cases is explained by the slow convergence of the degree distributions in these finite-sized networks to their limits. This is the case, for example, in the HRGs with γ = 2.1: the degree distribution in HRGs converges to its Pareto-mixed Poisson limit the more slowly, the closer the γ is to 2 [95].
The most notable results are for PA. Here the EV estimators clearly outperform the PLFit if γ = 2.1 or γ = 3, while all the estimators are on par if γ = 2.5 for the reasons that we discuss in the next section.

Anatomy of the PLFit
To understand the poor convergence of the PLFit in the double power law and preferential attachment cases in the previous two sections, it is instructive to recall first how exactly the PLFit works. The starting point of the PLFit operations is a sequence of possible γ-values γ s to experiment with. By default, this sequence is linearly spaced in the region [1.5, 3.5] with step 0.01 in the code [89] released with [18]. These default settings are mentioned in the code header, but it is worth noting that they are never mentioned in [18], and that the estimated values of γ on some real-world networks reported in [18] are larger than 3.5, meaning that on these (and possibly other) networks the code was executed not with its default settings. The default values of γ s in the code released in [19] are linearly spaced in [1.01, 6.50] with step 0.01. They are never mentioned either in [19] or in the code header.
Given the sequence γ s and a degree sequence of length n supplied as input data, the PLFit algorithm first finds the sequence of unique degree values k t appearing in the degree sequence. For each value k t , the algorithm computes the vector of log-likelihood values where n t is the number of nodes with degrees k i ≥ k t , 1 the indicator function, and is the Hurwitz zeta function. This likelihood is based on the assumption that degrees that are greater or equal to k t are i.i.d. samples from a pure power law with exponent γ s , i.e., from the generalized zeta distribution (2) with parameters γ s , k min = k t , and the normalization constant c = 1/ζ(γ s , k t ). Among all the considered values γ s , the algorithm then identifies the one, γ * t , that corresponds to the maximum value of L ts for the given k t . For the same k t , the algorithm then computes the Kolmogorov-Smirnov (KS) distance D KS t between the generalized zeta distribution with parameters γ * t and k min = k t , and the empirical CDF of degrees k i ≥ k t . This procedure is then repeated for each k t observed in the sequence, and the estimatesγ andk min that the algorithm eventually returns are those that correspond to the minimum D KS i.e.,k min = k t * andγ = γ * t * . The algorithm is thus a convoluted mixture of two optimization strategies: one is based on likelihood maximization, while the other one deals with the KS distance minimization. Its consistency has not been proven even for pure power laws, i.e., for the Pareto or generalized zeta distributions, but it is likely to be consistent in these cases. If the distribution is not a pure power law but a general regularly varying distribution, the consistency of the algorithm is a question that has not been rigorously explored at all. The algorithm may be consistent, or it may be not. The problem is that there is the following crucial logical inconsistency in the algorithm: it operates under the assumption that above a certain k min , the distribution of degrees k is a pure power law, but then it recognizes that the distribution may be not a pure power law, and tries to account for that by finding a reasonable value of k min such that above this value the assumption would hold "approximately." If the distribution was a pure power law, then the search for this k min would not be even necessary since the value of k min would be equal with high probability to the smallest value observed in the sequence. But if the distribution is not a pure power law, then such a value of k min simply does not exist, since for any k min the distribution of k > k min is not a pure power law. Yet the distribution of such ks may converge to a pure power law, but only if the k min value that the algorithm finds diverges with n, and only if the slowly varying function (k) converges to a constant. Therefore, for this subclass of regularly varying distributions with (k)s converging to constants, the algorithm may be consistent, yet the proof is currently lacking. If (k) does not converge to a constant, the algorithm is unlikely to be consistent for the reasons above.
In all the regularly varying distributions considered in the previous two sections, the function (k) does converge to a constant, and indeed in all these cases the PLFit appears to eventually converge. Yet in two of these cases, double power laws and preferential attachment, its convergence is much worse than of any of the considered EV estimators. To see why, we analyze the two components of the PLFit, KS distance minimization and likelihood maximization, separately-in Figs. 9 and 10, respectively. Figure 9 illustrates that the KS distance minimization component of the PLFit drives the values of k min that the PLFit attempts to estimate to erroneously low values. This happens because the smaller the k min , the smaller the deviations of the empirical CDF at degrees k right above k min from the theoretical CDF, because if the distribution is regularly varying, there are more nodes with smaller degrees. The larger the k min , the larger are these deviations caused by "sparser statistics" in the distribution tail, and as a consequence the KS distance grows larger. If k min is set to a small value, the deviations in the tail are suppressed as they are getting "squished" in the high-degree region of the CDF close to 1, cf. panels (a) and (b) in the figure.
Panel (c) in Fig. 9 shows that the plfit.m code [89], both originally released with [18] as well as its current version, cannot be used to compute the MLE values of γ for large k min s, because it contains errors in computing the Hurwitz zeta function with the required accuracy, leading to numerical errors. Therefore we use a SciPy [99] implementation instead in Fig. 10. Figure 10 shows that if k min is small-and it is, thanks to the KS distance minimization part of the PLFit-then the MLE component of the PLFit does very little other than trying to fit the loglog slope of the PDF evaluated at this k min . This fault occurs for the reasons similar to the KS distance fault: since there is a lot of data with degrees right above small k min s, and since the MLE is primarily concerned with fitting as much data as possible, it tries to fit the part of the distribution with degrees k right above k min , versus the true tail of the distribution with large ks, thus getting wrong estimates of the tail exponent.
In other words, the wrong estimates of the PLFit are due to the combination of the following two factors related, ironically, to the two key ideas behind the PLFit: (1) the small values of k min returned by the KS distance minimization part of the algorithm, and (2) the MLE part of the PLFit that estimates not the tail exponent but, roughly, the loglog slope of the PDF evaluated at this k min . If this slope is different than the slope at large ks, i.e., the true tail exponent, then the PLFit does not really fit any power-law tail, quite contrary to what its name suggests. However, if the distribution is such that at least one of these conditions is not satisfied, then the PLFit estimates are more accurate, cf. the Paretomixed Poisson or the γ = 2.5 preferential attachment cases in Fig. 10.
If these two conditions are satisfied, which is the case with the double power law and preferential attachment with γ = 2.1 and γ = 3.0 in Fig. 10, then the PLFit estimates of the tail exponent are quite wrong. But if they are wrong, and if one then performs KS hypothesis testing using these wrong exponent estimates, the hypothesis that the degree sequence comes from a pure power law with a wrong exponent will be rejected with high probability, explaining the strange finding in [19] that preferential attachment networks are "rarely scalefree." Finally, Figure 10 also shows that the whole idea of using MLE to estimate tail exponents of regularly varying distributions is quite problematic to begin with, explaining why it has not been seriously explored in statistics. Indeed, for such estimation to be correct, the values of k min must be large and diverging in the n → ∞ limit for the reasons discussed above. However, the larger the k min , the smaller the second term in (D9). On the other hand, as a function of γ s , the first term in (D9) grows monotonically with a much higher rate than the linear rate of growth of the second term. Therefore, if k min is above a certain threshold, then the MLE will do nothing but select the largest possible value of γ s to maximize the likelihood via the first term. This is exactly what we see in Fig. 10, where for many instances of large k min , the MLE-selected values of γ are the largest possible values within the range that we offer the MLE to operate with. Therefore, the correctness of MLE depends on whether there exists a "sweet-spot" range of values of k min that are not too large and not too small. It might be the case that such a range simply does not exist for some regularly varying distributions. Worse, even if it can be proven to always exist, which is unclear at present, we have seen above that the KS distance minimization procedure is quite unlikely to be a correct, statistically consistent procedure to identify this range. At least, the KS distance minimization has not been proven to be such a procedure. Whether a required procedure exists at all, is also extremely unclear. After all, for the reasons mentioned above, even the required sweet-spot range of k min s is quite unlikely to exist for regularly varying distributions whose slowly varying functions do not converge to constants.  [42]. Each network name is followed by its KONECT code in braces. The estimators return estimatesξ of ξ that are translated toγ = 1 + 1/ξ. Ifξ ≤ 0, thenγ is set to ∞. The estimates are colored according to the definitions in Section V: 1) not power-law networks (red) -at least one estimate is nonpositiveξ ≤ 0; 2) hardly power-law networks (orange) -all the estimates are positiveξ > 0, and at least one estimate isξ ≤ 1/4; 3) power-law networks with a divergent second moment (green) -all the estimates areξ > 1/2; and 4) other power-law networks (blue) -the rest of the cases.   [42]. The style and notations are the same as in Table I. The estimators and coloring are applied to the in-and out-degree sequences separately. Since the operation of the estimators is undefined on zeros, the zero entries in the in-and out-degree sequences are removed, explaining the different lengths of the in-and out-degree sequences nin and nout for the same network, as well as the different values of the in and out average degreeskin andkout.   [42]. The style and notations are the same as in Table II. The estimators and coloring are applied to the degree sequences of nodes of types 1 and 2 (domains 1 and 2) separately.