www.nature.com/scientificreports SCIENTIFIC F EPSRTS OPEN A Statistical Learning Framework for Materials Science:Application to Elastic Moduli of k-nary Inorganic Received:27 June 2016 Accepted:08 September 2016 Polycrystalline Compounds Published:03 October 2016 Maarten de Jong,WeiChen2,Randy Notestine,Kristin Persson12,Gerbrand Ceder Anubhav Jain2,Mark Asta&Anthony Gamst Materialsscientists increasingly employ machine or statistical learning(SL)techniquestoaccelerate able to overy and cn pur is work that addresses challenges in materialsscienc applications,where heaepictodnofeoweorHdeteenecotrmetrdeiptoeatege unds from a arow yersfirst-principles methods for ounds have advanced to the i custom m computing ares.Thu and istical to accelerate mae mber of of at that on atory,S sent addre me, Chaddress 6,USorCorrespondence and reuest for mateas should b SCIENTIFIC REPORTS |6:34256 |DOl:10.1038/srep34256
Scientific Reports | 6:34256 | DOI: 10.1038/srep34256 1 www.nature.com/scientificreports A Statistical Learning Framework for Materials Science: Application to Elastic Moduli of k-nary Inorganic Polycrystalline Compounds Maarten de Jong1,*,† , WeiChen2,*,‡ , Randy Notestine3, Kristin Persson1,2, GerbrandCeder1,4, Anubhav Jain2, MarkAsta1,4 & AnthonyGamst3 Materials scientists increasingly employ machine or statistical learning (SL) techniques to accelerate materials discovery and design. Such pursuits benefit from pooling training data across, and thus being able to generalize predictions over, k-nary compounds of diverse chemistries and structures. This work presents a SL framework that addresses challenges in materials science applications, where datasets are diverse but of modest size, and extreme values are often of interest. Our advances include the application of power or Hölder means to construct descriptors that generalize over chemistry and crystal structure, and the incorporation of multivariate local regression within a gradient boosting framework. The approach is demonstrated by developing SL models to predict bulk and shear moduli (K and G, respectively) for polycrystalline inorganic compounds, using 1,940 compounds from a growing database of calculated elastic moduli for metals, semiconductors and insulators. The usefulness of the models is illustrated by screening for superhard materials. In recent years, first-principles methods for calculating properties of inorganic compounds have advanced to the point that it is now possible, for a wide range of chemistries, to predict many properties of a material before it is synthesized in the lab1 . This achievement has spurred the use of high-throughput computing techniques2–5 as an engine for the rapid development of extensive databases of calculated material properties6–12. Such databases create new opportunities for computationally-assisted materials discovery and design, providing for a diverse range of engineering applications with custom tailored solutions. But even with current and near-term computing resources, high-throughput techniques can only analyze a fraction of all possible compositions and crystal structures. Thus, statistical learning (SL), or machine learning, offers an express lane to further accelerate materials discovery and inverse design2,5,13–27. As statistical learning techniques advance, increasingly general models will allow us to quickly screen materials over broader design spaces and intelligently prioritize the high-throughput analysis of the most promising material candidates. One encounters several challenges when applying SL to materials science problems. Although many elemental properties are available, we typically do not know how to construct optimal descriptors for each property, over a variable number of constituent elements. For instance, if one believes that some average of atomic radii is an important descriptor, there are many different averages, let alone possible weighting schemes, that one might investigate. This challenge may be reduced by placing restrictions on the number of constituent elements or types of chemistries or structures considered, but such restrictions reduce the generalizability of the learned predictor. Materials science datasets are often also smaller than those available in domains where SL has an established history. This requires that SL be applied with significant care in order to prevent over-fitting the model. Over-fitting 1Department of Materials Science and Engineering, University of California, Berkeley, Berkeley, CA 94720, USA. 2Energy Technologies Area, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. 3Computational and Applied Statistics Laboratory, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA. 4 Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. † Present address: Space Exploration Technologies, 1 Rocket Rd, Hawthorne, CA 90250, USA. ‡ Present address: Department of Mechanical, Materials and Aerospace Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA. *These authors contributed equally to this work. Correspondence and requests for materials should be addressed to M.d.J. (email: maartendft@gmail.com) received: 27 June 2016 accepted: 08 September 2016 Published: 03 October 2016 OPEN
www.nature.com/scientificreports/ o predictions the same time,smaleress of the underlying physical the avail ccessful application ofSLrequires the seectio t of descriptor candidates.In mate of the nher scicaceio612sand23 124)and hcriuicictoooacoc of ov itingsoude candida sted by kr such can le mod al can mpts t 山e of Holder mea s an ange from t um to m nctio th to th ab ly,the e We dv e pro lem uly th and or of mater SD ing geophys luctility an clopment of th range of o a (se and In this pap nstrate the of a fev trate w such models can ic compo ta tem ure SCIENTIFICREPORTS6:34256DOl:10.1038/srep34256
www.nature.com/scientificreports/ Scientific Reports | 6:34256 | DOI: 10.1038/srep34256 2 leads to predictions that are less generalizable to new data than anticipated28, such that predictions are less accurate than expected. At the same time, smaller datasets challenge us to use the available data as wisely as possible. This may include leveraging observations related to the smoothness of the underlying physical phenomenon, and the use of an appropriate risk criterion, rather than partitioning the available data into distinct training and test sets. For SL to have the greatest impact on materials discovery and design, we must pursue techniques that make maximal use of the available data. This requires approaches that are capable of systematically pooling training data across, and are thus capable of generalizing predictions over, k-nary compounds of diverse chemistries and structures. The successful application of SL requires the selection of an appropriate set of descriptor candidates. In materials science problems, the candidates must be capable of both “uniquely characterizing”22 a diverse range of compounds, and sufficiently explaining the diversity of the phenomenon being learned. Thus, the selection of descriptor candidates is a crucial and active field of investigation within materials science (e.g. refs 14, 22 and 23), as the field endeavors to develop general models with high predictive accuracy. Previous work in materials science has included both categorical descriptors (e.g. refs 23 and 24) and continuous descriptors (e.g. refs 22 and 23). Although both types of descriptors may be legitimately used in SL, special care should be taken when using categorical descriptors, as each such descriptor essentially (i.e., unless there is sufficient smoothing across cells) partitions the space of compounds into disjoint cells, which quickly increases the degrees of freedom and thus the risk of over-fitting the model. SL applications should always include descriptor candidates suggested by known, scientifically relevant relationships22,23. But in order to construct models that accurately generalize across diverse datasets, such candidates will typically need to be augmented with additional descriptor candidates, capable of bridging across the simplifying assumptions that divide less generalizable models. Without these additional candidates, attempts to learn more general models will be stifled, as it will be impossible to discover new, unexpected relationships. Here we introduce the use of Hölder means, also known as generalized or power means, as an ordered approach to explicitly constructing descriptor candidates from variable length numeric lists. Hölder means describe a family of means that range from the minimum to maximum functions, and include the harmonic, geometric, arithmetic, and quadratic means29. This paper advances previous work by constructing descriptor candidates as Hölder means, which, to the best of our knowledge, has not previously been done in the field of materials science. Having discussed the construction of descriptor candidates, we now introduce gradient boosting machine local polynomial regression (GBM-Locfit), which is a SL technique that we developed to leverage the available data as wisely as possible. Energy minimization problems often enforce smoothness in the functions mapping useful descriptors to outcomes. Statistical learning techniques may exploit such smoothness, when present, in order to produce models that are as accurate as possible for a fixed amount of training data; such considerations are more important when working with smaller training datasets than with larger datasets. GBM-Locfit utilizes multivariate local polynomial regression, as implemented in Locfit30, within a gradient boosting machine (GBM) framework31. Local polynomial regression performs a series of weighted regressions within a moving window, with a weight function that gives greatest weight to observations near the center of the window, producing a smooth curve that runs through the middle of the observations32,33. GBM uses a gradient descent algorithm to iteratively assemble a predictor while minimizing an appropriate, typically squared error, loss function31. Our approach enforces more smoothness in the functions mapping descriptors to outcomes than traditional tree-based GBM methods, which we suggest is appropriate for this and many other materials science problems. Additionally, the enforced smoothness helps minimize boundary bias (i.e., when the solution is flat over some peripheral region of the space of descriptors), which can be problematic with tree-based techniques when the data has sparsely populated tails. We believe GBM-Locfit will be advantageous for many materials science problems where datasets are of modest size, the underlying physical phenomenon is reasonably smooth and sparse regions have been carefully studied and are of particular interest. To illustrate both our GBM-Locfit approach and the use of descriptor candidates constructed as Hölder means, we predict the elastic bulk and shear moduli (K and G, respectively) of k-nary inorganic polycrystalline compounds. These moduli govern the stress-strain relations of isotropic materials within the linear-elastic range and are central to governing the mechanical behavior of materials in diverse contexts spanning geophysics to structural engineering. In addition, elastic constants are known to correlate with a wide range of other materials properties, including ductility and hardness34–37 and thermal conductivity38–40. Further, the single-crystal elastic constants are a direct measure of the bonding strength and directionality in a material, and are thus widely employed in the development of theoretical models for interatomic forces. Due to the importance of these properties, extensive efforts have been devoted to developing theoretical models of elastic moduli, relating their magnitude to structural and electronic properties such as atomic density, coordination, cohesive energy and Fermi energy27,41–46. But all of these models consider specific subsets of chemistries or structures, limiting their use for predicting the elastic properties of a diverse range of materials. A recent investigation employing nonparametric regression24 and categorical descriptors considered elastic constants for a diverse range of materials, but the results fail to generalize to new data (see Supplementary Figures S6 and S7). In this paper, we demonstrate the application of the SL framework described above to develop broadly applicable models for K and G, expressed in terms of a few descriptors that are either currently tabulated or easily computed. We demonstrate how such models can be used to enable materials discovery, by screening the hardness of over 30,000 compounds to identify superhard inorganic compounds that are thermodynamically stable or weakly metastable at a temperature of 0K. Our training dataset consists of 1,940 inorganic compounds from the Materials Project’s growing database of elastic constants constructed using first-principles, quantum mechanical calculations based on Density Functional Theory (DFT)12. The outline of this paper is as follows. In the methods section, we detail our SL framework, which includes safeguards against over-fitting our models. Then the predictive models for the elastic moduli are described in the
www.nature.com/scientificreports/ swith the accuracy of the predictions and ode of Methods which was introduced to ces a m which relatesth e is noise (assumed to be independent and identically distributed with zero mean and finite variance): y=m)+e (1) Globally,no str on ade being estimated at T once-diff energy arde5 )and and gen ies to c we con tion on,and for ur approach by Ghiring et but is less rel ())rp oo)to the n ente in particular composition,where the weigh ould be the molar quantities ofeach element -店含0 (2 4因-四俗容,s (3) sider the Holder m ans with integer pow s bet egative and posit We ruct these Holder ion desc der the stru otors lis d in T (up (lo ed directly o ting th 1-b03 stit rest-neigh oor bond leneths.and bond ar e h d for each site tec tion and te s th thed wek er ca hat eads to the greatest reduction in the size of the resid attenuates each weak learner by the SCIENTIFIC REPORTS |6:34256 |DOl:10.1038/srep34256
www.nature.com/scientificreports/ Scientific Reports | 6:34256 | DOI: 10.1038/srep34256 3 results section, including an overview of the descriptors and the prediction accuracy. In addition, we present a screening process for superhard materials and present a DFT validation. In the discussion section, we examine known issues with the accuracy of the predictions and conclude with a summary of the main advances presented in this work. Methods We begin our methods section with some background on local polynomial regression, which was introduced to the statistics literature by Stone32 and Cleveland33. Loader30 provides a general, yet thorough discussion of local regression, as well as implementation details of the Locfit software. Simply put, multivariate local regression produces a smooth surface that runs nominally through the middle of the observations. Thus, given response variables, y, and predictor variables, x, Locfit estimates a regression function, η, which relates these quantities, where is noise (assumed to be independent and identically distributed with zero mean and finite variance): y x = η( 1 2 ,,, x x ... m) + (1) Globally, no strong assumptions are made concerning the form of η, the underlying function being estimated. Locally, at each fitting point, the underlying function is assumed to be smooth, such that Taylor’s theorem allows the behavior to be described with a low order polynomial. Specifically, a once-differentiable function can be approximated locally by a linear function, and more generally, a k-times differentiable function can be approximated locally by a kth-order polynomial. In order to make local estimates of η, one must select a bandwidth, or smoothing window, and an appropriate smoothing kernel or weight function. Appropriate weight functions, such as Locfit’s default tricubic weight function, give greatest weight to observations near the center of the smoothing window and zero weight to observations outside the window. The local estimate of η, at each fitting point, is the intercept of the local regression centered at the fitting point, and these local estimates combine to produce a smooth estimate of the underlying function. We are interested in estimating smooth functions, because energy minimization problems often enforce smoothness in the functions mapping useful descriptors to outcomes. In this paper, we distinguish between composition and structural descriptors. Composition descriptors, such as average atomic radius and weight, are calculated from elemental properties and only require knowledge of a compound’s composition. Structural descriptors, such as cohesive energy and volume per atom, require knowledge of a compound’s specific structure (in addition to composition), and may be determined experimentally or calculated using DFT. We seek composition descriptors that generalize over k-nary compounds, but do not have a priori knowledge of how to combine the various elemental properties to construct descriptors that are optimal for our specific, yet very general problem. Thus we construct composition descriptors as a series of weighted Hölder means, rely upon Locfit to capture any necessary non-linearities, and rely upon model selection techniques and our GBM framework to select which descriptors are most useful at each iteration, and for each specific problem. Because GBM implements a version of the least absolute shrinkage and selection operator (LASSO)47, our approach has similarities to the statistical learning approach advocated by Ghiringhelli et al.22, but is less reliant upon sufficient, a priori, scientific insight and may thus be applied to more general problems. In equation (2), μp(x) represents the Hölder mean μ, to the power p, of the property x, taken over i values, with associated weights wi 29. Equation (3) gives the Hölder mean when p equals zero. Hölder means describe a family of generalized means that range from the minimum function (when p= −∞) to the maximum function (p=∞). An example would be calculating the cubic mean (p=3) of atomic radii over all constituent elements in a particular composition, where the weights would be the molar quantities of each element. µ = ∑ ∑ ≠ = − = ( ) x w w x , (p 0) (2) p i n i i n i ip p 1 1 1 1 µ = ∑ ∑ = − = ( ) x w exp l w x n( ) (3) i n i i n 0 i i 1 1 1 In this work, we only consider the Hölder means with integer power values between negative and positive four, which include the well known harmonic mean (p= −1), geometric mean (p= 0), arithmetic mean (p= 1), and quadratic or Euclidean mean (p= 2). We construct these Hölder based composition descriptors for each of eight elemental properties listed in Table 1 (upper), using elemental properties from pymatgen48. We also consider the structural descriptors listed in Table 1 (lower); most of these descriptors are obtained directly or post-processed from a single density functional theory (DFT) calculation per compound. The cohesive energy per atom, Ec, is estimated from DFT by subtracting the atom-in-a-box energies of the constituent elements, from the formation energy of each compound. Following a Voronoi tessellation49 of each unit cell, atomic coordinations, nearest-neighbor bond lengths, and bond angles between adjacent neighbors are calculated for each site. Additional structural descriptors are then constructed as Hölder means of these quantities over all sites; please see Supplementary Table SI for a full list of all investigated descriptors. Our GBM-Locfit implementation uses established model selection techniques, including 10-fold cross-validation and a conservative risk criterion, to determine which descriptors are the most useful for predicting K and G, without over-fitting the training data. The GBM framework iteratively assembles a predictor, P, as a sum of Locfit smoothed weak learners, ηi . At each iteration, GBM selects the smoothed weak learner candidate that leads to the greatest reduction in the size of the residual, ∆i , but attenuates each weak learner by the
www.nature.com/scientificreports/ roup nur ow number in periodic table ve energy per atom rgy per ate above hullperm onoi based site bond lengths oronoi based site bond angle De s (upper)and structural DFT and subsequent post-processing (lower) othed weak learner c P-Ang Locfit(D.D.Di) (4 predictionerror couldbe reduced BmA2eoE1o1Poo五a1 s set to t⊙2xzb etting o BM-L-6ti all of th ather t partitio of the data that minimizes ot loss function and the r g10%8 After this pro dfor each folda of-sample d er s for eac (MSE)is less nd the s com n is to the eems ov hen the size is s of de 100%of the data. ing the nu ned.an as in equation (4 y (V) 2 of the tericpol Fig 2,by comparing the VRH m from our DFT training Our best four des 00 5 CIENTIF1CREP0RT5I6:342561D010.10385rep34256
www.nature.com/scientificreports/ Scientific Reports | 6:34256 | DOI: 10.1038/srep34256 4 learning rate, λ, as shown in equation (4). Both of our final models use a learning rate of 5% and limit the level of interaction to 3 descriptors, Dj , meaning that Locfit is only run with three descriptors at a time, to create each smoothed weak learner candidate. = = ∑λη η ∆ = P D , Locfit( − , , D D, ) i (4) N i i i j k l 1 1 Although it is possible that the number of iterations required to achieve a given prediction error could be reduced by tuning Locfit’s smoothing parameters, we have opted to use Locfit’s default smoothing parameters and rely on the GBM method to provide an appropriate amount of flexibility to the fitted model. The one exception to this, is that Locfit’s degree of local polynomial is set to linear for all models, rather than using the default setting of quadratic. Our GBM-Locfit implementation utilizes all of the available data for training and relies on a conservative risk criterion to limit the number of iterations, rather than an explicitly partitioned test dataset, to avoid over-fitting the model to the data. As summarized graphically in Fig. 1, we perform 10-fold cross-validation (CV), using 90% of the data to select the weak learner that minimizes the squared error loss function and the remaining 10% of the data to estimate the mean and standard deviation of the out-of-sample squared errors, for each iteration and fold. After this process is completed for each fold and for a large number of iterations, the prediction errors are calculated as the mean (over folds) out-of-sample squared error for each iteration, and the standard errors of the prediction errors are estimated from the standard deviations of the out-of-sample squared errors for each iteration. The risk criterion determines the iteration threshold as the first iteration for which the prediction mean squared error (MSE) is less than the sum of the minimum prediction MSE (over all iterations) and the standard error of the prediction MSE at that minimum50, which has been shown to be conservative51. Please see Supplementary Fig. S1 for example performance curves. A more commonly used, but unconservative risk criterion is to simply establish the iteration threshold as the minimum prediction MSE (without adding one standard error), but this seems overly optimistic, particularly when the sample size is small (relative to the number of descriptor candidates) or the prediction MSE curve lacks a distinct minimum. After the iteration threshold is determined, a final GBM-Locfit model is fit using 100% of the data, but limiting the number of iterations to the previously established iteration threshold, to avoid over-fitting the model to the data. By limiting the number of GBM iterations, we inherently limit the number of weak learners, since each iteration adds one weak learner term to the predictor, as in equation (4). Results We demonstrate GBM-Locfit by learning the Voigt-Reuss-Hill (VRH) averages52 of the elastic bulk and shear moduli (K and G, respectively) which characterize polycrystalline compounds. More specifically, we learn log(K) and log(G) to avoid having the squared error loss function severely overweight the higher moduli materials. We present our predictions graphically for K and G in Fig. 2, by comparing the VRH moduli from our DFT training set12 with those learned by our GBM-Locfit method. Our best four descriptor models for log(K) and log( ) G are summarized in Table 2. None of our models with more than four descriptors have significantly better predictive accuracy than these four descriptor models, based Symbol Description Gn group number in periodic table M atomic mass R atomic radius (empirical) Rn row number in periodic table Tb boiling temperature Tm melting temperature X electronegativity Z atomic number Ec cohesive energy per atom Ef formation energy per atom Eg band gap Eh energy above hull per atom ρ density log(V) log of volume per atom Vc Voronoi based site coordination Vl Voronoi based site bond lengths Vθ Voronoi based site bond angles Table 1. Overview of descriptor candidates. Descriptor candidates for both moduli include composition descriptors constructed as Hölder means and geometric and arithmetic standard deviations of eight elemental properties (upper) and structural descriptors from DFT and subsequent post-processing (lower)
www.nature.com/scientificreports/ GBM-Locfit Implementation Overview Step 1:Determine Iteration Threshold Step 2:Construct Predictor 10-Fold Cross-Validation Loop GBM Loop (each with 90 of data) GBM Loop (with 100%of data) For Many Iterations For T Iterations (only) Weak Learner Evaluation Loop Weak Learner Evaluation Loop Best Weak Learner to Mode Best Weak Learner to Model Evaluate risk criteria Save(only)the Iteration Threshold:T Save Predictor ode mean er standard erro the ation thre hold.This appr of the Quater 1025 400 GDFT R1) 245 95 37.0 133 Table 2.GBM-Locfit model De SCIENTIFIC REPORTS |6:34256 |DOl:10.1038/srep34256
www.nature.com/scientificreports/ Scientific Reports | 6:34256 | DOI: 10.1038/srep34256 5 10−Fold Cross−Validation Loop GBM Loop (each with 90% of data) GBM Loop (with 100% of data) Step 1: Determine Iteration Threshold Step 2: Construct Predictor GBM−Locfit Implementation Overview Add Best Weak Learner to Model Locfit Weak Learner Evaluation Loop Add Best Weak Learner to Model Locfit Weak Learner Evaluation Loop Save (only) the Iteration Threshold: T Save Predictor For Many Iterations For T Iterations (only) Evaluate Risk Criteria Figure 1. GBM-Locfit implementation consists of two distinct steps. First, the iteration threshold is determined per the risk criterion, by running the GBM loop within a 10-fold cross-validation loop, in order to estimate the prediction mean squared error and associated standard error for each iteration. Second, the final GBM-Locfit model is fit with 100% of the data, while limiting the number of GBM iterations to the iteration threshold. This approach utilizes all of the available data for training, gives equal consideration to each compound, and avoids over-fitting the model to the data. 10 25 100 400 KDFT (GPa) 10 25 100 400 KGB M (GPa) Unary Binary Ternary Quaternary a 10 25 100 400 GDFT (GPa) 10 25 100 400 GGBM (GPa) Unary Binary Ternary Quaternary b Figure 2. Predictions. Comparison of DFT training data with GBM-Locfit predictions for K (a) and G (b). Training set consists of 65 unary, 1091 binary, 776 ternary, and 8 quaternary compounds. Model Rank Descriptor Underlying property RI (%) K 1 log(V) volume per atom 46.6 2 μ1(Rn) row number 24.5 3 Ec cohesive energy 19.4 4 μ−4(X) electronegativity 9.5 G 1 Ec cohesive energy 37.0 2 log(V) volume per atom 35.9 3 μ−3(Rn) row number 13.8 4 μ4(X) electronegativity 13.3 Table 2. GBM-Locfit model summaries. Descriptor rank and relative influence (RI) for our best four descriptor models for K and G. Composition descriptors are constructed as Hölder means to the power p, μp(x), of property x