Free datasets

Looking for public data sets? Below are some datasets we are providing open access to. You are free to use these for whatever you like. Some can also be integrated directly into Sajari, for example first name - last name combinations can be automatically detected in unstructured free text if samples such as the below are loaded into your Sajari engine.

We've also added a list of great public data sets below that we are constantly updating. You can also check out our blog where we will regularly be doing some data shakedowns and analysis.

Sajari datasets for download

First names CSV - Over 5,000 common first names

Last names CSV - Over 80,000 common last names

World universities - SQL dump of over 16,000 university names, location, web site, etc.

Industry list - Curated list of industries in JSON format

Action words - Words indicating some form of "action" has occurred. Useful for text analysis

Information gain for terms in job advertisements - Using ~600,000 job advertisements, the popularity and information gain of (unordered) trigrams was calculated along with an IDF popularity score. More on Information Gain

Positive and negative user interactions sample - Approx 100,000 learning points from job to resume matching using Sajari. Negative ratios (again for unordered trigrams) generally indicate the IDF score for a term is overvalued in match score calculation.

Public datasets

Stop words - multilingual collection of stopwords. Very useful for search engine design! :)

Data.Gov - USA Government is slowly opening up many datasets to the public

Google Word2Vec - Words and phrases converted into multidimensional vectors that retain some meaning.

KD Nuggets - About 50 great data resources for download here. Everything from Enron emails to NASDAQ data.

Reddit Datasets - Requests and discussion around various datasets

AWS Public data - Human genome, Google N-grams, etc.

Hacker News - Dataset discussion thread and links

Open Science Data cloud - Some overlap with AWS public data

Quandl - A search engine spanning many datasets, particularly financial and economic

UCI Irvine Machine Learning - repository of many datasets for machine learning

Data NSW - NSW Government data sets (also take requests)

World Bank - Economics, growth, health, pollution and more

Freebase - Community curated database of places and things

Million Song Database - a freely available collection of meta data for 1 million contemporary popular music tracks

Apple itunes data - Apps, music, etc from the Apple store.

NFL play by play - Play by play data for 2002-2013 seasons for download (don't ask!).

Common Crawl - public data of almost 5 billion web pages

Nasa Earth Exchange - earth science data including land temperature, atmospheric information and climate change data

Australian Government Data - more than 3,600 datasets from more than 140 Australian government organisations.

USA EPA - Environmental data sets such as pollution, etc

SMS Spam - Sample SMS spam data

Facial recognition - Data for facial recognition technology

Stanford teaching - Examples for statistical learning

Princeton's wordnet - Lexical dataset for the english language

Stanford Large Network - Great for network analysis like social networks, etc

Google's web corpus annotate with Freebase concepts - Entitisation

Google's public data search - Search for public data sets here

Google's public transit data feeds, e.g. bus and trains

Enquire for more details on our upcoming public data sets.

Enquire today
Successful businesses use Sajari