Sajari uses many different types of feature extraction to create structure from unstructured text. Some of these techniques are outlined below.
Pattern matching looks for specific patterns in unstructured text. Email addresses, phone numbers and dates are examples that follow very specific expression patterns. Sajari can very efficiently extract a variety of patterns from unstructured data.
Phrase matching looks for specific pre-determined phrases in unstructured text. The list of phrases can be added to Sajari in CSV format, thus any custom taxonomy can be used. The size of taxonomies can also be very large, Sajari routinely uses taxonomies on the order of 1 million phrases.
Phrase matching is incredibly useful for automatic tagging of documents, etc. The example above shows how "skills", "job titles", etc can be extracted from resumes and jobs. This is not only useful for display purposes, but the extracted entities can also be used as a component in custom matching algorithms. In the example of resume-job matching, the cosine similarity of "skills" between a resume and a job description is very useful in predicting a match score.
Proxy phrase matches
Proxies are similar to phrase matches, except the detected phrase is not added to the document, but rather this is a "proxy" for a different entity to be added. An example is "the bay area", which proxies to "San Francisco, USA" with lat=37.7833 and lng=122.4167. In this case even though the text does not contain a specific location match, or a latitutude - longitude combination, it can be derived.
Machine Learning classifiers
Classifiers are very useful for automatically categorizing input documents into groups. In the above example, "Sales" in the professional summary section is a Naive Bayes classification prediction (from approximately 30 classes) for this input document. Naive Bayes is not the only classifier we use, but it performs very well with unstructured text and as such we use it a lot.
Classifiers are not only great for grouping documents, but they also become incredibly useful for creating match algorithms. Not only can the prediction accuracy be measured, but the contribution of each classifier to the match score can also be measured.