Sample Data

ere is a list of data repositories that can be used to test methods, and increase your understanding of the statistical tools available for your use. Typically they will suggest methods applicable to the data, and give a brief background on what the data represents and the source they recieved the data from.

Free Data Repositories:

  • UCI Machine Learning Repository – UC Irvine’s Machine Learning Repository hosts hundreds of sample datasets, some of which also have papers linked in which the datasets are used. The datasets are geared towards machine learning algorithms, so will include a lot of classification problems.
  • UCI KDD – UCI’s Knowledge Discovery in Databases Archive. Good datasets for data mining, separated by task and application.
  • DataMarket – A repository of Government datasets for several countries. The data here tends to be time series.
  • Statlib – List of Dataset Repositories maintained by Carnagie Melon.
  • Professor J. Burkardt of Florida State University hosts a plethora of academically inclined datasets for many statistical methods.
  • Wharton Research Data Services – University of Pennsylvania’s Wharton School of Business hosts this data repository. It contains feeds from data vendors such as COMPUSTAT, CRSP, and many others. These data sets are business or economic oriented. If you are a graduate student of San Diego State, you can request an account from WRDS. Some highlights include:
    • CRSP – Historical Price and Performance metrics for stocks, bonds, and more
    • CompuStat – Database of Financial and Market information on tracked companies. Database is maintained by S&P Capital IQ
    • Bank Regulatory – Provides accounting data for Banks, Bank Holding Companies, and other entities in the finance field
    • Federal Reserve Bank – Provides Historical Information on Exchange Rates, Interest Rates
  • Freddie Mac – The common name for the Federal Home Loan Mortgage Corporation, publishes some datasets on Home Mortgages and Housing Price Index.
  • Bureau of Labor Statistics – Publishes data on employment rates, indices, inflation, and other items they track. This is a good source for complementary data.
  • R datasets Package – In R, there exist a number of datasets available for use. Some packages that implement unique types of statistical analysis will include example datasets. The R datasets package, included in the vanilla installation of R, includes several sets useful for demonstrating and learning. Individual datasets can be invoked with the data() command. I.e., data(cars) creates a dataframe with the cars dataset.