The study evaluates the gains of avoiding data-dependent specification search on an estimation sample in an application to discrete choice models. We incorporate data splitting, the process by which the total available sample is randomly split in two or more sub-samples with the first (specification) sub-sample used for specification search, and the second (estimation) sub-sample used for obtaining clean estimates using the model chosen on the specification sub-sample according to a set criterion. We estimate 14 binary Logit models of the adoption of conservation tillage corresponding to the major sub-watersheds of the Upper Mississippi River Basin. For each of the sub-watershed models, we use the specification sub-sample to choose the explanatory variables that lead to the highest number of correct predictions provided that estimated coefficients are in conformity with economic theory. To evaluate the gains of avoiding specification search on the estimation sub-sample, we follow Gong (1986)[8] and calculate the expected excess error, which is a measure of excess optimism concerning model fit on the specification sample. We find that the excess optimism varies with the sub-watersheds and has a tendency to be larger for the sub-watersheds with smaller samples.


Downloads Statistics

Download Full History