
Here's a program that takes as input a piece of text and build up syllable, word and phrase knowledge from that text. It builds up an internal representation of the text using pairings of letters, pairing of these pairings and so on, in such a way that it can (eventually) learn to properly segment/store the text at word, phrase, sentence and discourse levels.
It can then use this knowledge to 'comprehend' future texts, parsing the text into the internal language model constructed.
It also has an ability to handle variations of known structures at different levels, from character-level spelling errors through word-level slot/frame ideas to higher-level notions of variants of phrases within a discourse.
The most interesting aspect of the program, though, is that given conversation-based training data such as film scripts or IRC chat-logs, it can conduct conversation based on this knowledge, generating responses by fitting the recent history of chat to the best known conversation-script known, allowing for adaptations at character/word/phrase/etc levels.
Current version of the program is 4.5.
This shows how the program initially extracts word and phrase knowledge from the text 'Alice in Wonderland'.
Click here to see the initial output of the program with no prior knowledge
Click here to see output after the program has parsed the text file once.
And after 2 times, 3 times, 4 times, 6 times, 8 times, 10 times.
Ultimately after about 20 runs, the program ends up having 'memorized' the entire text, having a single 'node' representing the full story. It is this native ability to store large amounts of discourse knowledge in a compressed and easily/quickly searchable form that makes this system potential chat-bot material.
The program can then use this knowledge of the text to interpret new input, as described Here
Click Here to see an example conversion, given a 5 megabyte textfile - built from various film/tv scripts available on the Web - as training.
A similar chat, but with tracing enabled is shown Here
The program is freely available as a x86 Linux binary, Windows 95/98/NT executable or 'C' source code, (all are less than 100K to download). A Perl version is also available.
The algorithms to implement the virtual neural net, the heart of the meme machine, are very simple, using lots of weird and fiddly recursion to get the job done, so the whole code is less than 2500 lines of C.
Click Here to download the source code. It will compile on most UNIX systems with the gcc compiler available (It is known to build and run properly on Linux, HPUX Sun and Irix boxes.) as well as on Windows 95/98/NT using the DJGPP compiler. This file is in the standard zip format.
See the README file for compiling / using. The unpackaged source can be viewed here
The best O/S to use with the program is Linux though, especially when working with very large networks. Click Here to download a pre-built ready-to-go version for Linux PC's.
If you have Windows 95/98/NT and want a ready-to-use version of the program, click Here for information on downloading and usage.
The Perl version of Meme is available Here. Thanks to Phil Perry for this.
There are some ramblings on how the program works Here, but the main points are:
The next stage for the program is to experiment with large training-sets and net constructs. I'm feeding a 512 megabyte net as much IRC log-data and Usenet traffic as I can get hold of. At this level, a network with literally millions of nodes representing words, phrases, jokes, idioms, short stories (language memes in general) develops. The results will be interesting...
The direction of development of the software is improvements in the chatbot-side of the software, but just now it provides a virtual knowledge-net and primitives to play with to move in this direction.
Brian Smith (brian@wintermute.demon.co.uk)