Under the Hood with Text Analytics

There are opinions all over the map about what this state-of-the-art really is, how entity extraction is actually operating and to what extent it is semantic or statistical or a hybrid of machine learning and natural language kinds of technologies.

Today we will talk about that except that I am going to come to a relevant point in a previous post. There is this competitive drive companies have, and the idea is that analytics is going to allow you to do a better job than your competitors. I am talking about not just text analytics but also data mining and BI products. So there is a drive to make all of this more effectively efficiently.

A related area is that we do have mandates. I am talking about so-called e-discovery. This was the change to the federal rules of civil procedure that regulate American court procedures that mandate the discovery of electronically stored information, and it came into effect in December of 2007. There are similar mandates that are roughly equivalent in compelling the adoption of text analytics in other countries.

With that in mind, think about the basics of different approaches. There is semantics versus statistics, and what do we mean by that? Well let me say rather than semantics, linguistics. The idea is that when you and I are speaking the language that we are speaking, whatever it is, it could be something other than English has a certain grammar to it. It has a certain parts of speech that are inferable from the grammar and also from what I referred earlier to as a word morphology.

So when I say ‘I like’ versus ‘he likes,’ like and likes are two different forms. Morphology, that’s the study of forms, understands that like and likes are two different forms of the same word. So we have the concept of stemming which gets rid of plurals. It gets rids of noun declensions or verb conjugations to understand what are the roots. Stemming is part of so called lemmatization, or forming a canonical version of a given name. So for instance, John and John Smith are the same person in the context of our conversation right now so I could lemmatize those to John Smith.

So these are all a little bits of linguistic analysis that go into text analytics. When I understand the subject, the verb and the object of a clause in a sentence, then I can understand the subject is an entity. A verb connects an entity to an object, and it allows us to have that machine processable meaning extracted from it.

So what are entities? We have a concept of named entities which is something that you might look up in a list. It might be a list of proper names of individuals of companies or geographic locations and so on. But we also have pattern based entities. If you are a Java programmer or a python programmer, you understand the concept of regular expressions. They are basically patterns. And so for instance, a number that appears as digit, digit, digit, four digits, then two digits, is recognizable as the pattern for social security number in the United States. So we have pattern based entity recognition.

Statistics is capable of applying some lexicons. It might be lists of basic names and doing some stemming, some very shallow types of linguistics. But you then go from there to look at the juxtaposition of words. The basic statistical approach to understanding a text wouldn’t be to try to break down the parts of speech but rather to take the whole text as a so-called bag of words. And then see what that bag of words contains.

So for instance, if you are doing search engine optimization, let’s talk about a real world example that maybe people will understand better. You might look for the terms that occur most frequently in a document, and those are likely to indicate the theme of that document. When I say term I don’t mean just a single word or even a single stemmed word, I might mean a few words that occur together.

So we can use statistics or we can use linguistic patterns to try to understand text. Now where the machine learning comes into play is that you learn from examples or experiences to try to improve your accuracy. I wouldn’t really want to get into machine learning much more right now. I am no expert in it, but in any case accuracy is a key point here for one thing. When two individuals are speaking we often have misunderstandings.

Even in human to human speech, we never have 100% accuracy. Accuracy often depends on both the intent of the speaker and the context of the conversation. So to give an example that I really like you know what’s Ford? If you are on Capitol Hill before some committee right now, Ford could be actually the name of a congressman, or it could be the name of an automobile manufacturer.

It could be the name of, in a different context, of a former President. In some context that automobile manufacturer becomes actually an aerospace company and so on. So context is very important here, and these technologies have the capability of trying to understand the context by looking at the other contents of a given message or given document and to try to disambiguate and understand whether we are talking about Ford the President or Ford the car company or Ford an actual car.

Latest Images

Trending Articles

Latest Images