Datasets for Fun and Profit

May 29, 2019

Typically, a Data Scientist would be interested in a customer’s data. However, such data is not always available, and it is often insufficient. For example, when studying new techniques or application domains (maybe without an actual customer), trying to improve existing models with new features or even starting a project for a customer who did not yet manage to extract his or her own data. In such circumstances, public (both free and commercial) datasets might be convenient. Here’s a compilation of a few I have recently found to be potentially useful for my present projects. Some I discovered by myself, and some I got from other people’s compilations. Furthermore, note that by dataset I mean data that is programmatically accessible (e.g., CSV files, relational databases, APIs), not, say, a collection of arbitrarily formatted Excel spreadsheets, let alone PDFs.

(December 31st, 2019 note: I am updating this list as I find and remember about more datasets, particularly some relatively obscure ones.)

General Datasets

UC Irvine Machine Learning Repository: A traditional repository of data for Machine Learning researchers and practitioners. Provides all sorts of data, from biomedical to financial. Interestingly, also classifies datasets by the type of data (e.g., categorical, numeric, time series) and learning task (e.g., classification, regression).

Kaggle: Well-known for their role in competitions, Kaggle datasets can be applied to other purposes. Large companies often provide internal (anonymized) data for competitions, which can be quite valuable to outsiders. For example, the famous Netflix prize data is here.

KDD Cup: Data from the anual KDD Cup. “KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining”. In particular, there is useful data for churn analysis and other CRM problems.

Google Dataset Search: A dataset search engine.

Awesome Public Datasets: A growing and cooperative list of many datasets, indexed by topic.

DataHub: “thousands of datasets for free and a Premium Data Service for additional or customised data with guaranteed updates.”

Time Series Data Library: An R package with many classic time series data. I believe this used to be a collection of CSV files, which unfortunately has been converted into an R package. However, to use it with other tools (e.g., a Python stack), we can still access the data directly by downloading the relevant files from its GitHub repository.

UCR Time Series Classification Archive: Another compilation of time series data. This one has a quirk too, we need a password to open the file containing the data! Let me help you, it is simply the word “someone”. The authors were annoyed that they did not get credit for their work, so they have put this difficulty to force users to read the documentation (and find the password). Remember readers, cite them if you use their data. By the way, I am seeing a disturbing pattern in these time series people.

Czech banking behavior (or my alternative version): Real anonymized Czech bank transactions, account info, and loan records released for PKDD’99 Discovery Challenge. Also known as the “Berka dataset”. This is a very rare type of dataset, since banks typically keep these data as closely guarded secrets. Furthermore, I provide some basic analyses and visualizations with respect to transactions here. If you’ve never seen this kind of behavior, it is instructive.

Economic and Financial Data

Quandl: Many kinds of alternative and core economic data. Has both free and commercial offers. It seems to be the most accessible among its competitors (see below) because it charges different prices for personal and commercial use.

Global Financial Data: Alternative and core financial data. Commercial.

Quantopian Data: Quantopian is a quantitative finance technology company that aims to make the approach accessible to any programmer. To this end, they provide a number of different data sources.

FactSet: Another source of alternative and core economic data.

B3 (Brazilian stock exchange): Brazilian stock market historical data, containing very detailed transactions information (e.g., each individual trade). This goes back a few months (~8) from the present.

The Economist Data: What to expect from one of the best (if not the best) newspapers in the world? Data, of course. In particular, includes the famous Big Mac Index data. So now you can easily check whether you should buy or sell your local currency! By the way, did you know that they have an audio version of their content? It is great to don’t feel like one is wasting time in the gym — but I digress, you can’t really use it to do audio data processing.

World Bank Open Data: Has a focus on global development data.

Consumer Data

Consumer data is most often used for market segmentation. Nevertheless, it can be used to study all kinds of consumer behaviors and these data providers strive to develop industry-specific offers.

Acxiom: Anonymized consumer data from multiple sources, mainly for marketing purposes. For example, at the time of writing they have special data packages for “Valentines Day” and “St. Patrick’s Day”. Commercial.

Neoway: Consumer data for the Brazilian market. Besides data itself, has related services such as the “next best customer” search. Commercial.

Government Data

Many governments are adopting open data practices.

U.S. Government’s open data

European Data Portal

UK Government’s open data

Brazil Government’s open data

Geographic Imagery

These are more sophisticated than one might imagine at first. For instance, besides visible light, many satellites (as well as planes and drones) also capture non-visible spectrum, which allows the calculation of important indicators (E.g., NDWI, the Normalized Difference Water Index). Agriculture is one noteworthy application.

DigitalGlobe: High-resolution satellite imagery. Commercial.

EOS Landviewer: Provides 10 free satellite images. Commercial options include subscriptions and the purchase of specific images, particularly high-resolution ones.

AirScout: High-resolution geographic imagery captured using airplanes. I know they can get 5cm/pixel or better resolution. This will be typically used for applications that require detailed images of specific areas, since it would be very expensive to cover large sections of Earth’s surface. Commercial.

Agrodata: A small Brazilian broker specialized in finding and selling customized imagery solutions, including some of the above. A convenient one-stop shop, particularly for agricultural needs. Commercial.

Commerce and Retail

Yelp Open Datasets: A selection of Yelp’s businesses, customer reviews and more.

Other

Microsoft Research Geolife: Detailed GPS-tracked movement of 182 individuals during 3 years.