On Intelligence

06 Jul 2012

I want to make my computer intelligent.

Obviously this is kinda difficult, so I’m going to start by writing a program of limited intelligence.

Intelligence is very hard to define. The Wikipedia article is essentially a big list of different characteristics or attributes of intelligence. There isn’t a single, fundamental definition that encapsulates and describes everything that an “intelligent” system can do that sets it apart from one without intelligence (say, a computer). Is Google intelligent?

Intelligence as pattern-matching

Recently I found an interesting definition of intelligence by Ben Goertzel, designer of the OpenCog project:

He defines intelligence as the ability to detect patterns in the world and in the agent itself.

This would make intelligence simply pattern-matching. This fits with my idea of intelligence as extrapolation — an intelligent system spots patterns in the environment and can extrapolate from those patterns. It can learn rules and apply them to new situations.

Intelligence as prediction

The following is from a review of Jeff Hawkins’ book On Intelligence (which I really must read sometime). He discusses the differences between artificial and human intelligence, and the structure of the brain.

…predictions are made -without us being aware of it- about what will happen next. The incoming patterns are compared to and combined with the patterns provided by memory resulting in your perception of a situation. So, what you perceive is not only based on what your eyes, ears, etc tell you. In fact, these senses give you fuzzy and partial information. Only when combined with the activated patterns from your memory, you get a consistent perception.

So the brain is constantly predicting its environment, and this plays a role in perception. You don’t notice or actively perceive things that stay as expected. (Note that matching the predictions isn’t the same as not changing.)

According to his book, the brain (or more specifically the “neocortex”) has a hierarchical structure. If what is observed doesn’t match what is predicted, the information is passed to a higher level in the neocortex, “to check if the situation can be understood on a higher level”. The highest levels make predictions about more abstract things. The brain’s structure is said to mirror the nested structure of the real world.

This might explain why computers, with their extreme precision, or approaches to AI that try to model its environment precisely — when we as humans are not seeing and hearing precisely what is happening — are doomed to failure.

It also seems to explain the behaviour of children, for whom everything is novel, and they have not yet learnt to predict their environment accurately — so everything is new and draws their attention.

He extends this further to say that behaviour, eg. the motor system, is controlled by prediction.

Hawkins says that doing something is literally the start of how we do it. Remembering, predicting, perceiving and doing are all very intertwined.

This might give us a clue as how to deal with intention.

Intelligence as compression

In Rationale for a Large Text Compression Benchmark, Matt Mahoney discusses models for compressing large amounts of text (1GB of Wikipedia articles in this case) based on modelling language, and argues that “ideal text compression… would be equivalent to passing the Turing test for artificial intelligence”.

Language modelling can be used to predict which word is most likely to occur next. Prediction is equivalent to compression, as better predictions let us use less space to represent the word:

It is well known that text compression can be achieved by predicting the next symbol in the stream of text data based on the history seen up to the current symbol. The better the prediction the more skewed the conditional probability distribution of the next symbol and the shorter the codeword that needs to be assigned to represent this next symbol.


Mahoney also notes:

The optimal behavior of a rational agent is equivalent to compressing its observations. Essentially [Hutter] proved Occam’s Razor, the simplest answer is usually the correct answer.

It certainly seems obvious as a general principle that by finding a general rule — for example a model of language, or a rule for getting the plural of a word from its singular — a system can store just the rule, and possible exceptions, and discard the rest of the data. I assume we don’t remember the plural of every single word; only the general rule.


Past observations can be compressed by finding a general rule to describe them all. This rule can be used to extrapolate to deal with new situations and therefore predict present observations, or even the future. The rules are constantly being refined to ensure the predictions match present observations more accurately.

Simpler rules are more likely to be correct due to Occam’s razor (often used by scientists when developing theories describing nature): “simpler explanations are, other things being equal, generally better than more complex ones”. However, more complex rules should be preferred if they give sufficiently more accurate results, while avoiding overfitting.

Finally, a hierarchical structure is useful to separate low-level from higher-level, more abstract predictions. If lower-level components’ predictions don’t match observations, the information is passed up the structure to higher levels.

This isn’t really a definition, and it’s quite unsatisfactory as one. But it’s interesting!