Avoiding biases from data-dependent specification search: an application to a tillage choice model

The study evaluates the gains of avoiding data-dependent specification search on an estimation sample in an application to discrete choice models. We incorporate data splitting, the process by which the total available sample is randomly split in two or more sub-samples with the first (specification) sub-sample used for specification search, and the second (estimation) sub-sample used for obtaining clean estimates using the model chosen on the specification sub-sample according to a set criterion. We estimate 14 binary Logit models of the adoption of conservation tillage corresponding to the major sub-watersheds of the Upper Mississippi River Basin. For each of the sub-watershed models, we use the specification sub-sample to choose the explanatory variables that lead to the highest number of correct predictions provided that estimated coefficients are in conformity with economic theory. To evaluate the gains of avoiding specification search on the estimation sub-sample, we follow Gong (1986)[8] and calculate the expected excess error, which is a measure of excess optimism concerning model fit on the specification sample. We find that the excess optimism varies with the sub-watersheds and has a tendency to be larger for the sub-watersheds with smaller samples.

Issue Date:
Publication Type:
Conference Paper/ Presentation
DOI and Other Identifiers:
Record Identifier:
PURL Identifier:
Total Pages:
Series Statement:
Selected Paper

 Record created 2017-04-01, last modified 2020-10-28

Download fulltext

Rate this document:

Rate this document:
(Not yet reviewed)