Looking for public data sets? Below are some datasets we are providing open access to. You are free to use these for whatever you like. Some can also be integrated directly into Sajari, for example first name - last name
combinations can be automatically detected in unstructured free text if samples such as the below are loaded into your Sajari engine.
We've also added a list of great public data sets below that we are constantly updating. You can also check out our blog where we will regularly be doing some data shakedowns and analysis.
First names CSV - Over 5,000 common first names
Last names CSV - Over 80,000 common last names
World universities - SQL dump of over 16,000 university names, location, web site, etc.
Industry list - Curated list of industries in JSON format
Action words - Words indicating some form of "action" has occurred. Useful for text analysis
Information gain for terms in job advertisements - Using ~600,000 job advertisements, the popularity and information gain of (unordered) trigrams was calculated along with an IDF popularity score. More on Information Gain
Positive and negative user interactions sample - Approx 100,000 learning points from job to resume matching using Sajari. Negative ratios (again for unordered trigrams) generally indicate the IDF score for a term is overvalued in match score calculation.
Stop words - multilingual collection of stopwords. Very useful for search engine design! :)
Data.Gov - USA Government is slowly opening up many datasets to the public
Google Word2Vec - Words and phrases converted into multidimensional vectors that retain some meaning.
KD Nuggets - About 50 great data resources for download here. Everything from Enron emails to NASDAQ data.
Reddit Datasets - Requests and discussion around various datasets
AWS Public data - Human genome, Google N-grams, etc.
Hacker News - Dataset discussion thread and links
Open Science Data cloud - Some overlap with AWS public data
Quandl - A search engine spanning many datasets, particularly financial and economic
UCI Irvine Machine Learning - repository of many datasets for machine learning
Data NSW - NSW Government data sets (also take requests)
World Bank - Economics, growth, health, pollution and more
Freebase - Community curated database of places and things
Million Song Database - a freely available collection of meta data for 1 million contemporary popular music tracks
Apple itunes data - Apps, music, etc from the Apple store.
NFL play by play - Play by play data for 2002-2013 seasons for download (don't ask!).
Common Crawl - public data of almost 5 billion web pages
Nasa Earth Exchange - earth science data including land temperature, atmospheric information and climate change data
Australian Government Data - more than 3,600 datasets from more than 140 Australian government organisations.
USA EPA - Environmental data sets such as pollution, etc
SMS Spam - Sample SMS spam data
Facial recognition - Data for facial recognition technology
Stanford teaching - Examples for statistical learning
Princeton's wordnet - Lexical dataset for the english language
Stanford Large Network - Great for network analysis like social networks, etc
Google's web corpus annotate with Freebase concepts - Entitisation
Google's public data search - Search for public data sets here