Deep Learning-Enhanced Variational Monte Carlo Method for Quantum Many-Body Physics

Artificial neural networks have been successfully incorporated into variational Monte Carlo method (VMC) to study quantum many-body systems. However, there have been few systematic studies of exploring quantum many-body physics using deep neural networks (DNNs), despite of the tremendous success enjoyed by DNNs in many other areas in recent years. One main challenge of implementing DNN in VMC is the inefficiency of optimizing such networks with large number of parameters. We introduce an importance sampling gradient optimization (ISGO) algorithm, which significantly improves the computational speed of training DNN in VMC. We design an efficient convolutional DNN architecture to compute the ground state of a one-dimensional (1D) SU($N$) spin chain. Our numerical results of the ground-state energies with up to 16 layers of DNN show excellent agreement with the Bethe-Ansatz exact solution. Furthermore, we also calculate the loop correlation function using the wave function obtained. Our work demonstrates the feasibility and advantages of applying DNNs to numerical quantum many-body calculations.

Introduction -Over the past few years, artificial neural networks have been introduced to study quantum many-body systems [1][2][3][4][5][6][7][8].In their seminal paper [1], Carleo and Troyer proposed to represent the quantum many-body states by a Restricted Boltzmann Machine (RBM), which contains one visible and one hidden layer.The many-body wave function is represented by the visible layer after integrating out the hidden layer.The parameters in the RBM are trained in the VMC method.Following this work, the RBM and a few other networks have been applied to study several quantum many-body systems with good accuracy [1][2][3][4][5][6][7][8][9][10][11].So far, the networks that have been implemented in quantum physics studies are not deep and hence not powerful enough to represent more complicated many-body states.As a result, there has been no clear evidence that their performance far exceeds the more traditional state-of-the-art numerical algorithms sch as quantum Monte Carlo, density matrix reformalization group, or tensor networks, etc.To overcome this problem, deep neural networks (DNNs) have been suggested.Theoretical studies have shown that the DNNs can efficiently represent any tensor network states and most quantum many-body states, and possess distinct advantages over shallow networks [12][13][14].In fact, the DNN-based deep learning has become the most successful model of many machine learning tasks and has dominated the field since 2012.The DNNs have been demonstrated to have comparable or superior performance in various tasks when compared to human experts, such as playing Atari games [15], Go [16,17], and manipulating robots [18,19], etc, and have led to rapid advances in artificial intelligence.
Despite of great interest, there has been relatively few works in applying DNNs to quantum many-body computations [20,21].This perhaps is due to the fact that applying DNNs to represent quantum many-body states faces two main challenges: inefficient optimization and insufficient information for the proper choice of DNN architectures.The former arises because a DNN typically contains a large number of parameters to train, while a proper choice of the architecture often requires physical insights about the nature of the quantum systems.
In this Letter, we propose an efficient convolutional DNN architecture to represent the ground state of quantum manybody systems.Most of the quantum systems consist of particles interacting with each other through finite range.Such a local interacting character can be ideally captured by the convolutional neural network (CNN).We have developed an importance sampling gradient optimization (ISGO) algorithm within VMC method, which significantly improves the optimization speed and hence enables us to utilize DNN architectures.Our method can take advantage of the automatic differentiation, which automatically computes gradient update via backward-propagation algorithm [22].We show that our method can be parallelized and take full advantage of the acceleration provided by graphics processing units (GPUs).The ISGO method achieves at least one order of magnitude speed up when trained on GPUs [23].
For benchmark purpose, we construct DNNs to represent the ground-state wave function of the 1D SU(N ) spin chain, which has an exact solution under the Bethe ansatz.We systematically test different DNN architectures with ISGO for systems with different complexity.Our numerical results for the ground-state energies of 1D SU(N ) spin chain show excellent agreement with the exact solutions.Furthermore, we are able to compute correlation functions which are extremely difficult to obtain by Bethe Ansatz.The convolutional DNN architecture we constructed for this work can be readily generalized to represent the ground states of other quantum manybody systems.The ISGO method can also be used to accelerate the computation based on general VMC methods.
Network Architectures -We consider a homogeneous 1D SU(N ) spin chain with N site spins, which is the simplest prototypical model with SU(N ) symmetry, governed by the FIG. 1.The architecture of a convolutional DNN with L = 8 hidden layers.The input state is encoded into a 2D tensor of shape [Nsite, Nin], and fed into the input layer (represented by the leftmost pink rectangle).For value encoding Nin = 1 and for one-hot encoding Nin = N .The blue rectangles stand for the activation (feature maps) of the hidden layers.Convolution filters (the small pink rectangles) transform one hidden layer to the next one.The last hidden layer (on the right) is reduce summed and followed by a fully connected layer to give the lnΨ output. Hamiltonian: where P i,i+1 is the spin exchange operator which exchanges two neighboring spins: P i,i+1 |a i , b i+1 = |b i , a i+1 .This model can describe the behavior of 1D strongly interacting quantum spinor gases [24][25][26][27], and has attracted significant attentions both experimentally and theoretically [28][29][30][31][32][33][34][35].Here we will use the DNN to represent the ground-state wave function of this model.A general state takes the form |Ψ = {si} Ψ(s 1 , s 2 , ..., s Nsite ) |s 1 , s 2 , ..., s Nsite , where each s i represents one of the N spin states for the SU(N ) model.The goal is to build a network that takes the basis state |{s i } as input and compute the ground state wave function Ψ({s i }) such that the energy functional Ψ|H|Ψ / Ψ|Ψ is minimized.The first step is to encode the input basis state into a 2D tensor S j,β , where the first and the second indices j and β represent the spatial site and the local spin state, respectively.In this work, we consider two kinds of state encodings: value encoding which encodes each spin state into a number and one-hot encoding which encodes the spin state into a one-hot boolean vector.The tensor S is fed into the DNN as an input.The output of the first hidden layer, which follows immediately after the input layer, is given by: where A [1] is the activation (or feature map) of the first hidden layer, σ(x) = max(x, 0) is the Rectified Linear Units (ReLU) activation function, which has been demonstrated to outperform traditional sigmoid activation function for DNNs [36], W [1] is a 3D tensor of shape (K, N in , F ), where K is the convolution kernel size, F the number of channels of the hidden layer, and N in the number of the channels of the input layer which is 1 for value encoding and N for one-hot encoding.In this work, we use the same K and F , which determines the width of the network, for every hidden layer.b [1] is a bias vector of size F .The output from the remaining hidden layers are given by: where L is the total number of hidden layers that determines the depth of the network, W [l] a 3D tensor of shape [K, F, F ], and b [l] a bias vector of size F .After the last hidden layer, its output is summed along the spatial dimension, and followed by a single fully connected layer to give the final output of the network where a is a weight vector of size F .The full structure of the network is illustrated in Fig. 1.Each magenta rectangular object corresponds to a convolutional filter.We use periodic padding for each convolutional layer to enforce periodic boundary condition.This network is fully convolutional [37], which means the network architecture is compatible with different system sizes, and we can easily do transfer learning.
Here, W [l] , b [l] , and a are the network parameters that need to be optimized.The total number of parameter is roughly KF 2 L. Importance Sampling Gradient Optimization -Before introducing the ISGO method, we first revisit the conventional gradient optimization method in VMC.The wave function Ψ({w}) is encoded by the set of network parameters w ∈ {W [l] , b [l] , a}.In every iteration step, N sample quantum states following distribution P 0 x ∝ |Ψ 0 x | 2 are sampled using a Markov chain.Here Ψ 0 is the input wave function from the previous step, and x indexes the sampled states.The variational energy functional is then computed.To minimize E v ({w}), the network parameters are updated as w ← w − α∂ w E v , where the "learning rate" α is a small parameter [38].In our work, we use the Adam [39], a variant of the Stochastic Gradient Descent algorithm.Here, ∂ w E v is approximated by the variational wave function Ψ 0 with the N sample samples where x is the local energy under Ψ 0 and I 0 = 2/N sample .After the parameters w are updated, the w S U M 0 / V I Q 8 h 4 j 3 W g Z q j P P N C N e P z C k B 4 Y p U X b g T L l I x 2 r P y d i 5 m k 9 8 F z T 6 T H s 6 l l v J P 7 n 1 S J s 5 x u x 8 M M I w e e T R e 1 I U g z o K A / a E g o 4 y o E h j C t h b q W 8 y x T j a F J L m x C c 2 Z f / k v J x z r F z z o V J I 0 8 m S J J d s k e y x C E n p E D O S Z G U C C e 3 5 I E 8 k W f r 3 n q 0 X q y 3 S W v C m s 5 s k V + w 3 r 8 B y w C Z + Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y X b v 6 q I F P x 5 E j J + S U F I l L r k i F 3 J A q q R F O H s k z e S V v 1 p P 1 Y r 1 b H 7 P W F W s + c 0 T + w P r 8 A Y i O l s 4 = < / l a t e x i t > Ex < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 h X f i j 5 c t 5 c 6 m P H E 0 q f e i w h e 6 l 0 = " > A A A B 6 n i c b V D L S g N B E O y N r x i N i X r 0 M h g E T 2 H X i z k G R P A Y 0 T w g W c L s Z D Y Z M j u 7 z P S K Y c k n e P G g i F e / x E / w 5 t 8 4 e R w 0 s a C h q O q m u y t I p D D o u t 9 O b m N z a 3 s n v 1 v Y 2 y 8 e l M q H R y 0 T p 5 r x J o t l r D s B N f x b 2 w a V Z g j D 8 d w A q f g w z n U 4 B r q 0 A A C A 3 i E Z 3 h x u P P k v D p v 8 9 a c s 5 g 5 g j 9 w 3 n 8 A r A G Q R Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " n M I 1 4 f x b 2 w a V Z g j D 8 d w A q f g w z n U 4 B r q 0 A A C A 3 i E Z 3 h x u P P k v D p v 8 9 a c s 5 g 5 g j 9 w 3 n 8 A r A G Q R Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y m U C 7 + q a E i h R F e 4 N W 6 t 5 6 t N + t 9 N l q w 5 j t 7 8 A v W x z e y j Z W 7 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " g B W m / w 7 j w 7 r 8 7 H A l f F W X I 7 g j 9 y P n 8 A D 8 i S t A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " j b L s z m / w 7 j w 7 r 8 7 H A l f F W X I 7 g j 9 y P n 8 A D 8 i S t A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " R Z q U z R c n C T B X i P H h j e H / N u H t F a 7 q z C 3 9 g f f w A l y O X J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " R Z q U z R

c n C T B X i P H h j e H / N u H t i 5 o = " >
x Ix@wln x, Ix = 2Px/P 0 x /Nsample < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 2 l y 0 r e r d q X J u m + P y V + R v v s N 7 q z E A Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T O P q 6 k 4 U l T a S Z B 7 h  wave function changes from Ψ to Ψ which serves as the input for the next iteration step, where a new set of states are sampled based on Ψ and the previously sampled states based on Ψ 0 are discarded.
Inspired by the off-policy policy gradient method in Reinforcement Learning (RL) [40,41], we develop an efficient importance sampling gradient optimization (ISGO) method that utilizes the mismatched samples, as shown in Fig. 2(a).The key is to renormalize the distribution of those mismatched samples to |Ψ x | 2 by multiplying the local energies and derivatives in Eq. ( 5) with importance sampling factors: where E x is the local energy under the last updated wave function Ψ, and x | with C the normalization factor which can also be approximated using x I x /I 0 = 1.The key difference, in comparison to the conventional method, is that, within each iteration step, the network parameters w (and hence the wave functions) are updated multiple times.This enables us to use the N sample sampled states much more efficiently.Furthermore, the update procedure can be efficiently parallelized and run on GPUs.
We plot the variational energies versus iteration step and wall time for a 60 sites SU(2) chain with a 16 layers CNN in Fig. 2(b) and (c) [23].As can be seen, the ISGO method converges with much fewer samples and much faster on GPU than the conventional method.We also implement the ISGO method using Tensorflow with auto differentiation [42], which allows us to try new network architectures much more easily.Our code for both RBM and CNN can be found in Ref. [43].We emphasize that, although we choose a particular gradient optimization algorithm Adam in our work, the concept of ISGO is general and can be implemented with any other optimization methods.
Numerical Results for 1D SU (N ) Spin Chain -We test our DNN on the Sutherland model, the 1D homogeneous SU(N ) spin chain governed by Hamiltonian (1).We pick this model for two main reasons.First, the ground state energy of this model can be exactly solved by Bethe Ansatz [44], which allows us to benchmark our results [23].Second, the number of spin states N controls the complexity of the system, which allows us to systematically study the efficiency and accuracy of the DNN as the complexity of model grows.Numerical details can be found in [23].
Figure 3 shows our main results of the ground-state energies for various N on an N site = 60 chain with a DNN with varying depth (i.e., number of layers L) and width (i.e., kernel size K).We tested two encoding methods for the input state.Fig. 3(a,b) correspond to the value encoding, while (c,d) correspond to the one-hot encoding.The value encoding imposes ordinality, i.e., different spin states are encoded into a number with the average value to be zero.For one-hot encoding, different spin states are encoded into a vector orthogonal to each other, and thus are not ordinal.For example, for an N = 3 system, the three spin states are encoded into values of −1/2, 0, 1/2 in value encoding, and into vectors (1, 0, 0), (0, 1, 0), (0, 0, 1) in one-hot encoding.The one-hot encoding requires more computational resources (in terms of both memory and computational time) than the value encoding, and scales lineally with respect to N .However, in general it yields better accuracy than the value encoding method.This could be due to the fact that one-hot encoding encodes each spin state into a vector which effectively enlarges the dimension of the parameter space.Optimization in such an artificially enlarged space helps to prevent the system stuck in metastable states [45].
Figures 3(a,c) display the results from a single-layer network with varying width.As one can see, increasing the kernel size K helps bring the ground-state energy closer to the exact solution represented by horizontal dashed lines, which indicates that, for such a shallow network, large kernel size is necessary for capturing long-range effects mediated by nearest-neighbor interactions.Here, one-hot encoding performs significantly better than value encoding especially for SU(N > 2), where, no matter how large the kernel size is, the energies computed via value encoding do not converge to exact solutions.In Fig. 3 (b, d), we fix the width of the network, but vary its depth by adjusting the number of layers L.Even for a relatively small kernel size K = 3, increasing L helps to bring the computed ground-state energy closer to the exact result.Therefore, a DNN can capture long-range effect even with a small kernel size.For the SU( 5) model (the largest N we used in the calculation), the energy does not converge to exact solution using value encoding with kernel size 3. Simply by increasing the kernel size to 5, we can reduce the computed ground-state energy and match it with the exact solution as the black squares shown in Fig. 3(b).We vary the number of channels F and find that the energy results are not sensitive to F .More details can be found in the Supplemental Materials [23].
The Bethe Ansatz method can yield energy spectrum and, in principle, the many-body wave function for exactly solvable models such as the one considered here.However, due to the complexity of a general many-body wave function, it remains a tremendous challenge to compute other useful quantities such as the correlation functions.Often, advanced numerical techniques are needed for such tasks [46,47].To further demonstrate the power of DNN, here we show our results for the loop correlation function where, the expectation value is taken with respect to the ground state, and (m • • • n) is the loop permutation operator that permutes the spatial indices in the wave function by Physically, this operator puts the spin in the original n th position to the m th position and correspondingly move the spins at the original i th (with i = m, ..., n − 1) positions to their neighboring position on the right.The loop correlation function appears in the definition of the one-body density matrix of 1D strongly interacting quantum spinor gases whose ground state can be represented by a strong coupling ansatz wave function [24,26,27,48,49] due to the fact that such wave functions must obey the permutation symmetry rule originated from quantum indistinguishability.For the homogeneous system we considered here, S m,n = S r with r ≡ n − m.We plot S r for an SU(N ) spin chain with N site = 60 spins in Fig. 4(a), and its discrete Fourier transform S k = | r S r e ikr | in Fig. 4(b).S r and S k characterize the correlation in the real and the momentum space, respectively.By taking the Jordan-Wigner transformation, an SU(N ) spin chain with each spin component having N site /N spins can be mapped to a nearest-neighbor interacting Ncomponent fermionic system with each component having N site /N fermions [49,50].The peaks of S k at k = ±π/N , that can be clearly seen in Fig. 4(b), correspond to the Fermi points of those fermions.These peaks lead to the singularities of momentum distribution of strongly interacting spinor Fermi gases at the same momentum point [49,50].
Conclusion and Outlook -We have constructed a DNN, combined with VMC, to study the ground state of the 1D SU(N ) spin chain.The key in our work is the development of the ISGO algorithm, which can be straightforwardly applied to any type of variational wave functions, for the optimization procedure.This algorithm allows us to efficiently train the network, and is particularly suitable for training DNNs which typically contains a large number of parameters.Note that the VMC with the ISGO algorithm may be interpreted as an RL process if we identify the Markov-chain state trajectories in the former as the state transitions/policies in the latter.We tested the network to solve the 1D SU(N ) spin chain model and systematically investigated the performance of the network by varying its depth and width.We have found that, when using value state encoding, as the complexity of the model increases by increasing N , it is not sufficient just to increase the width (i.e., kernel size) of the network, one needs to add more depth to capture the long-range correlation of the quantum state.We only show numerical results computed by the DNN up to 16 layers.We do not observe any significant benefit by using much deeper networks up to 100 layers on this model.This could be due to a potential problem of vanishing gradients in very deep networks (see Ref. [51,52] and references therein), which may be alleviated via using other network architectures such as ResNet [52], which we leave for future studies.Finally, we note that another key finding from our work, which has not been discussed in previous works, is the importance of input states encoding.We find that one-hot encoding, although requires more computational resource, in general leads to much more accurate results than value encoding.
In conclusion, our study clearly demonstrates that it is feasible to use DNNs to represent quantum many-body wave functions and to significantly enhance the efficiency of numerical quantum many-body computation.Applying machine learning techniques to quantum many-body physics is still a young and emerging field with many open questions.We believe that such investigation will not only benefit quantum physics, but may also help us to gain deeper insights into the neural networks.

MORE NUMERICAL DETAILS AND RESULTS
We typically use 1500 iteration steps, the initial learning rate for Adam is set to 10 −3 for the first 100 steps and then changed to 10 −4 , N sample = 20000 except for the first 10 steps to avoid out-of-memory, and N optimize = 100 for the first 1000 steps and then set to 10.We average the network parameters over the last 100 steps when calculating physical quantities.
Besides the kernel size K and number of layers L, we have also studied the effects of changing the number of filters F .We fix the kernel size K = 11 with one hidden layer, and the results are shown in Fig. 5.We found that for the value encoding in Fig. 5(a), the energies are not sensitive for SU (2) and SU(3) models.But for SU(4) and SU(5) models, the variational energies are still far away from the exact solutions.While for the one-hot encoding in Fig. 5(b), all energies are not sensitive to number of filters.In Table I, we list the ground state energies obtained from our numerical calculation with a one-layer CNN and compare them with the Bethe Ansatz (BA) exact results.One can clearly see that the one-hot encoding yields much more accurate results than the value encoding.The one-hot results are comparable to those given in Ref. [46].

COMPARISON WITH RESTRICTED BOLTZMANN MACHINES
So far, Restricted Boltzmann Machines (RBMs) remain as the most widely used machine learning tool in solving quantum many-body physics.In representing wave functions, an RBM can be regarded as a 1-hidden layer neural network [54], with activation function being ln(2 cosh(x)).Also, the RBM wave function in Ref. [1] uses translational symmetry, which is equivalent to a CNN with kernel size K = N site and periodic boundary condition.The number of channels α in Ref. [1] plays the same role as the number of filters F in our work.We plot the ground state variational energies of a convolutional RBM (K ≤ N site ) with different K in Fig. 6, which should be compared with Fig. 3(a)(c) in the main text, where we plot the energies as a function of convolutional kernal size K in a one-layer CNN.It can be seen that the RBM results are very similar to those of the one-layer CNN although they use different activation functions.In addition, Fig. 6 also shows that, just as in our work, one-hot encoding with RBM has significant advantage over value encoding for N > 2.

COMPARISON OF GPU AND CPU PERFORMANCE
To compare the computational speed of the ISGO method on GPUs and CPUs, the wall time for the calculations of the SU(2) model on a N site = 60 size cluster with model architecture (L, F, K) = (16,8,3) is plotted in Fig. 2(c) in the main text.We have performed simulations on a single K80 GPU with 11GB memory of Google Colaboratory and two 2.30GHz Intel(R) Xeon(R) CPUs.The ISGO method has two steps: Monte Carlo sampling and importance sampling gradient optimization.The Monte Carlo sampling step is always calculated on CPUs.The simulation with N optimize = 1 corresponds to conventional GO method, whose performance on GPU (blue line) is similar to the one of the ISGO method on CPUs (green line), as in main text Fig. 2(c).Remarkably, the simulation with the ISGO method on GPUs (red line) arrives at a good variational energy at about 0.5h, while the conventional GO method (blue line) requires about 5h to obtain the similar accuracy for variational energy.The simulation with the conventional GO method on CPUs (magenta line) is the slowest one due to python overhead.The results shows that the simulations with the ISGO method on GPU are at least one order of magnitude faster than the conventional GO method on any hardware.We also observe the speedup is even greater for more complex networks.

4 j 3
W g Z q j P P N C N e P z C k B 4 Y p U X b g T L l I x 2 r P y d i 5 m k 9 8 F z T 6 T H s 6 l l v J P 7 n 1 S J s 5 x u x 8 M M I w e e T R e 1 I U g z o K A / a E g o 4 y o E h j C t h b q W 8 y x T j a F J L m x C c 2 Z f / k v J x z r F z z o V J I 0 8 m S J J d s k e y x C E n p E D O S Z G U C C e 3 5 I E 8 k W f r 3 n q 0 X q y 3 S W v C m s 5 s k V + w 3 r 8 B y w C Z + Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 s o U w L e y t h I 6 o p Q 5 t O w Y b g r b 6 8 T l o X V c + t e r c 2 j R o s k I c T O I V z 8 O A S 6 n A D D W g C g y E 8 w Q u 8 O t J 5 d t 6 c 9 0 V r z l n O H M M f O B 8 / X P a P S g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " n M I 1 4 E P y i 2 d v M z H E w Q 3 M 4 7 6 O / 2 c = " > A A A B 6 n i c b V D L S g N B E O y N r 7 g a j X r 0 M h g C n s K u F 3 M M i O I x o n l A s o T Z y W w y Z H Z 2 m Y c Y l n y C F w + K e B U / x E / w 5 t 8 4 e R w 0 s a C h q O q m u y t M O V P a 8 7 6 d 3 N r 6 x u Z W f t v d 2 S 3 s 7 R c P D p s q 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " j b L s z f T v w M e o S s 1 c 7 Y 6 5 N r p t 0 / d w / 6 q H S 3 0 C u 2 g t y h F e + g Q Ha M M D R G P W l E v 2 o v 6 8 W b 8 M d 6 P v y y l c b T y b K M 7 E Q / + A H 7 U x O 4 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 2 l y I K Z B 1 J 2 G c N x s 6 F u N v X O P b 5 4 = " > A A A C i 3 i c d V H B b h M x E P U u B U K g k N I j F 6 s R i A O E 3 Y L U q F C p U h W p v a B F I m 2 l O F h e x 2 m t 2 l 7 L n m 0 3 W u V n + C R u / A 3 e J K p o C i N Z f p 7 3 n j 2 e y a 2 S H p L k d x Q / 2 H j 4 6 H H r S f v p s 8 3 n L z p b L 0 9 9 U T o u h r x Q h T v P m R d K G j E E C U q c Wy e Y z p U 4 y 6 + O G v 7 s W j g v C / M d Z l a M N b s w c i o 5 g 5 r p t 0 / d w / 6 q H S 3 0 C u 2 g t y h F e + g Q Ha M M D R G P W l E v 2 o v 6 8 W b 8 M d 6 P v y y l c b T y b K M 7 E Q / + A H 7 U x O 4 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 2 l y I K Z B 1 J 2 G c N x s 6 F u N v X O P b 5 4 = " > A A A C i 3 i c d V H B b h M x E P U u B U K g k N I j F 6 s R i A O E 3 Y L U q F C p U h W p v a B F I m 2 l O F h e x 2 m t 2 l 7 L n m 0 3 W u V n + C R u / A 3 e J K p o C i N Z f p 7 3 n j 2 e y a 2 S H p L k d x Q / 2 H j 4 6 H H r S f v p s 8 3 n L z p b L 0 9 9 U T o u h r x Q h T v P m R d K G j E E C U q c Wy e Y z p U 4 y 6 + O G v 7 s W j g v C / M d Z l a M N b s w c i o 5 g 5 r p t 0 / d w / 6 q H S 3 0 C u 2 g t y h F e + g Q H a M M D R G P W l E v 2 o v 6 8 W b 8 M d 6 P v y y l c b T y b K M 7 E Q / + A H 7 U x O 4 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " h P 3 H Y S b G 2 9 D 7 7 o d B u 3 w M Y A 6 n M M F X E E I N 3 A H D 9 C B L g h I 4 B X e v Y n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " h Y r H W / H H + F P 8 e S m N o 5 X n F b o X 8 e A P f Z T E 6 g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 2 l y I FIG. 2. (a)The flow chart of the ISGO algorithm within the VMC method.A network is used to represent the ground state wave function.In every iteration step, the sampler generates Nsample samples following distribution P 0x ∝ Ψ * 0 x Ψ 0 x .Then the importance sampling optimizer updates the network parameters through back propagation in a loop for Noptimize times.Noptimize = 1 corresponds to the conventional gradient optimization method.The whole process is iterated until convergence.The sampler and the optimizer share the same wave function.We compare the training curves for a Nsite = 60 SU(2) spin chain using a (L, F, K) = (16, 8, 3) CNN with one-hot encoding: (b) Variational energy versus iteration steps.(c) Variational energy versus wall time on GPU/CPU.The initial learning rate for Adam is 10 −4 .

FIG. 3 .
FIG. 3. The ground-state energy for Nsite = 60 SU(N ) spin chains (N = 2, 3, 4, 5) using CNN with value encoding (a)(b) and onehot encoding (c)(d).(a) and (c) are for one layer L = 1 with fixed channel number F = 8 and different kernel size K.(b) and (d) are for fixed channel number and kernel size (F = 8, K = 3) but different number of layers L. In (b), the black squares are for F = 8 and K = 5.The horizontal black dashed lines are exact results from Bethe Ansatz.

FIG. 5 .
FIG. 5.The 1-hidden layer CNN ground state energies for 60 sites SU(N ) spin chains, where N = 2, 3, 4, 5. (a) is for state value encoding.(b) is for state one-hot encoding.The only variable parameter is the number of filters F. The black dashed lines are exact results from Bethe Ansatz.

FIG. 6 .
FIG. 6.The RBM ground state energies for 60 sites SU(N ) spin chains, where N = 2, 3, 4, 5. (a) is for state value encoding.(b) is for state one-hot encoding.The only variable parameter is the kernel size K.The black dashed lines are exact results from Bethe Ansatz.

TABLE I .
Comparison of the variational energies of a one-hidden layer CNN for Nsite = 60 SU(N ) spin chains with exact Bethe Ansatz (BA) results.The kernel size is K = 19.Both results from the one-hot and the value encoding are included.The incertitudes on the VMC data are smaller than 10 −4 .