Google data set search - Search engine specifically for public data sets
Newsroom summaries - 1.3 million articles and author written summaries
Microsoft open data - Microsoft research open data sets
Stop words - multilingual collection of stopwords. Very useful for search engine design! :)
Data.Gov - USA Government is slowly opening up many datasets to the public
Google Word2Vec - Words and phrases converted into multidimensional vectors that retain some meaning.
KD Nuggets - About 50 great data resources for download here. Everything from Enron emails to NASDAQ data.
Reddit Datasets - Requests and discussion around various datasets
AWS Public data - Human genome, Google N-grams, etc.
Hacker News - Dataset discussion thread and links
Open Science Data cloud - Some overlap with AWS public data
Quandl - A search engine spanning many datasets, particularly financial and economic
UCI Irvine Machine Learning - repository of many datasets for machine learning
Data NSW - NSW Government data sets (also take requests)
World Bank - Economics, growth, health, pollution and more
Freebase - Community curated database of places and things
Million Song Database - a freely available collection of meta data for 1 million contemporary popular music tracks
Apple itunes data - Apps, music, etc from the Apple store.
NFL play by play - Play by play data for 2002-2013 seasons for download (don't ask!).
Common Crawl - public data of almost 5 billion web pages
Nasa Earth Exchange - earth science data including land temperature, atmospheric information and climate change data
Australian Government Data - more than 3,600 datasets from more than 140 Australian government organisations.
USA EPA - Environmental data sets such as pollution, etc
SMS Spam - Sample SMS spam data
Facial recognition - Data for facial recognition technology
Stanford teaching - Examples for statistical learning
Princeton's wordnet - Lexical dataset for the english language
Stanford Large Network - Great for network analysis like social networks, etc
Google's web corpus annotate with Freebase concepts - Entitisation
Google's public data search - Search for public data sets here
First names CSV - Over 5,000 common first names
Last names CSV - Over 80,000 common last names
World universities - SQL dump of over 16,000 university names, location, web site, etc.
Industry list - Curated list of industries in JSON format
Action words - Words indicating some form of "action" has occurred. Useful for text analysis
Information gain for terms in job advertisements - Using ~600,000 job advertisements, the popularity and information gain of (unordered) trigrams was calculated along with an IDF popularity score. More on Information Gain
Positive and negative user interactions sample - Approx 100,000 learning points from job to resume matching using Sajari. Negative ratios (again for unordered trigrams) generally indicate the IDF score for a term is overvalued in match score calculation.