Graphipedia, Context and knowledge

Motivation – The Grand Plan?

Rather large caveat: The approach below simply extracts links and builds a graph database based on page title and links within it, you get the neighbours of any one page this way.

However, when I started I was after what links and WHY. The second aspect is a much more challenging aspect – discussed further down.

Quite simply I was studying something completely different, gene expression programming, and realised I knew nothing about neural networks. So I backtracked and started on these, to find I didn’t have a deep grasp of stats and probability. I’d been doing Coursera courses and kept seeing similar key concepts bandied about: Markov, Monte Carlo, percolation, model thinking and many more. Things I’ve half a grasp of.

I’d seen Graphipedia on GitHub before and I like* concept mapping*. Neo4j stores the relationships AND the item – a perfect concept mapper!

I want a way to automate the tedious aspects of building contextual understanding – finding out what do I don’t know and need to know in order to understand this new thing? You could say I wanted an automatic ignorance mapper!

Can I get a tool that will tell me what concepts I need to understand other concepts and what other concepts knowing the first concept helps me to understand. <<|Yep, I really did this with that sentence.

I burnt a lot of bandwidth and decided to get ALL of Wikipedia, from here:

Wikipedia Data Dump

Given that I’m allergic to wheel invention, I performed Google fu and found this handy Java program:

Graphipedia – Original Author

Graphipedia – Source

Building Graphipedia requires Maven – go here for this and the command mvn install in the folder you extracted Graphipedia to. I had to set my JAVA_HOME correctly to allow Maven to run.

This blog helped

There are two stages to the import task:

Extracting links

java -classpath graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks wiki.xml towiki-links.xml

Importing them to the graph database

java -Xmx3G -classpath graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph towiki-links.xml graph.db

These stages took about twenty minutes on an i7 9xxx with 32gb and ssd.

Now the fun part: you end up with about 7gb of a neo4j graph db that needs a little persuasion to load up. I had numerous issues with garbage collection blocking and memory performance.

One noteable point, the type of relationship is Link. That’s all, along with the type of node being a page with one property, it’s title.

So to take this further, more steps would be needed. For example, predicting the type of link using machine learning would potentially yield a graph with meaningful relations. However, this is not a trivial problem if we are only relying on the content of wikipedia.

Match (p0:Page{title: 'Neural Networks'}) RETURN p0

Screenshot 2014-03-23 00.46.07

Finally, this graph database has significant performance issues – partly to do with my own level of knowledge of course. Being 7.8 gb with only one type of relationship probably doesn’t help either.

However, here are the settings I’ve used to get it going from the Neo4j properties:


# Enable this to be able to upgrade a store from an older version

# Enable this to specify a parser other than the default one.

# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true"
keep_logical_logs=7 days

Neo4j Performance

The JVM settings I’ve altered to maximum heap before garbage collection occurs:


Before anyone says it, I’m not a Java developer so please look here if you need proper help with this aspect:

Java Flags

Final Thoughts

I’m intent on building a simple knowledge differential engine and then use this to develop an intelligent personal learning agent. Both aspects will require lengthy investigation into the world of AI and machine learning – it’s a recursive project as I started it BECAUSE I knew that there’s lots that I don’t know that I don’t know <|* Yep, I did it again!*

The concept is touch the concepts you want to know about, your presented with various concepts that are related, you indicate how much you think you know about each. Other assessment modes may be adapted over time as there is a lot of existing work to examine on this.

Ultimately, the software learns enough about what you know to provide efficient, relevant concept maps to assist you in building your own mental cognitive map of any complex area.

The act of learning isn’t something that should be automated but the act of finding a map for the territory is perhaps one that can be made far more efficient, leaving the learner with just the act of getting to grips with the relevant stuff.

Solving this, originally meant a convenience tool for me in learning about AI and other complex topics.

Now, I see that to do it, I’ll have to know a fair bit of practical AI, stats/probability, parsing and all that, graph databases and all sorts of intermediate items.

It’s a little bit recursive!

Concept mapping I find to be superior to the later, popularised mind mapping because, like Neo4j, you are storing the relationship type between concepts. In mind mapping, everything springs from a central organising concept whereas in concept mapping, such a mind map would be merely a node.

I’m going back to the drawing board with this because ideally this needs to be an on demand agent that pulls what it needs, when it needs it, intelligently. I’m expecting to have the opportunity to use evolutionary programming like gene expression programming, neural networks, F# and a few other tools to realise this little concept. However, I also expect that I’ve underestimated the complexity of the task by about 2 orders of magnitude!

Self reflectivity, the Knowledge Mapping Agent and the Shape of Knowledge

Genius requires mandatory hard work.
Most hard work offers no guarantee of genius results.
The vision I have here, in part, I’d like to think of as a way to sway the odds a little.

Imagine having a constant feedback tool that actually allows you to watch your real world understanding grow AS you study, in real time by showing you the every changing map of your knowledge – perhaps in glorious, living 3D!

In the simplest case, the KMA merely seeks to ask questions and find out what you know compared with your own stated goal of gaining expertise in a subject area. It will attempt to build a knowledge map – a differential map – of concepts that will bridge your understanding to the new material, piece by piece.

  • To understand tensors, you need to understand vectors, scalars, co-ordinate systems and so on.
  • To understand calculus you need some algebra, know what a function is, understand ratios / gradients and what a graph is.
  • To understand the idea of a design pattern in programming, you need to understand some oop, methods/routines/functions, abstractions, simple control structures and so on.
  • To know why DNA cannot be used by the cell for it’s various functions due to lacking tertiary structure, you need to understand primary, secondary, tertiary protein structures, what a protein is, what a cell is and so on.

Each of the points above can be imagined as a little graph of concepts – a concept map, an idea decades old, pre-dating the popular mind map and considerably closer, in my view, so how a brain may store knowledge.

The KMA would need to discover the shape of the knowledge you understand, partially understand and don’t understand. This would allow it to present an ideal and efficient path for understanding the missing concepts.

Lets take this a little further:

Imagine knowledge in a brain can be given a shape, using data visualisation. It’s not impossible to imagine, at least for some types of knowledge this being a literal topographical shape in the brain.

However, here, somehow we’ve gained a data map of what you do or do not understand about a subject. What if somehow we could take this a lot further?

Skipping some very difficult obstacles for now, we could potentially learn the shape of knowledge in any one mind and even abstract it.

This would be modelling on an incredible yet detailed scale.

We’re not done yet:

  • For example, what does a virtuoso violinist mind look like as an abstracted knowledge map? Or a nobel prize winner?
  • Can such a knowledge shape be induced in other minds, in whole or part?
  • Is there a way of boiling that shape down to the essence of what is required to attain similar understanding?
  • What about modelling groups?
    • Star football players?
    • Research groups?
    • Tacticians?
    • High performance software engineers?
  • Can we recreate these entities?
  • If we can get such a map, can we also GENERATE such a map?
  • Can we make hybrid maps?
  • Reflexively, can we even breed them?
    • Having abstracted out a data representation of the sum of all knowledge in a single mind or team of minds, can we actually operate on them in a way to breed hyper versions of the same?

Humanity gained its intellect through a capacity to self reflect in part – we can can conceive of and experience the self, we learn through feedback from the world, through interaction. Can we learn faster if from deliberate, strategic self reflection? Yes – evaluation and reflection are long established steps in learning strategies.

So, having a tool that maps our knowledge, we can also examine the growth of our knowledge and allowing us to reflect on the effectiveness of HOW we are understanding new ideas. We can plainly learn to contrast between between the illusion of understanding and actually having it.

Biofeedback is used to allow folks to learn to control heart rate, blood pressure and other normally unconscious activities; would live feedback of our learning process allow us to finally see the effects our learning strategies are having on acquiring expertise and knowledge? By seeing day by day, the visual representation of our growing map, we could also correlate quality and speed of growth to our methods, proving and disproving different approaches continuously.

As a final point, such abstracted knowledge mapping would also make for incredible rich AI fodder.

The Knowledge Mapping Agent (KMA) itself – a proposal:

The tool could start from quite humble beginnings. One developer has produced something called Graphipedia that basically imports a wikipedia dump into Neo4j, a graph database. Google have a knowledge graph project of their own.

What we’d be doing is creating an intelligence agent – the KMA – that gets to know you. In the first instance it’s for a simple task, you want to read up on a specific subject for a meeting or piece of writing. The KMA would require some goals – what subject area, books, references and so on. The more specific the better.

It would examine the concepts involved – the narrower the scope the simpler the problem of course – and generate a knowledge map. From this, the simplest implementation may simply ask you to rate your current understanding on a numerical scale of various key nodes in the map. As many concepts are quite important, these often have many more connections so could easily be ranked using common algorithms. So you are presented with what the agent thinks are the most important concepts.

A differential map of your current perceived knowledge and the new material is produced. This difference map, at first may only be a fine guide as to what you should spend some time on to gain the new knowledge in the most efficient way. However, the full vision I have is that the KMA actually offers ways to gain this knowledge, feeding constant assessment back into it’s own knowledge and use this to improve it’s understanding of what you do know and how you learn it.

This active training aspect will rely on a lot of techniques in use now or in development in various learning platforms. In addition, simple technologies like Kinect, Leap Motion, hand writing and voice recognition as well as the new material technologies that allow you to make circuits out of common objects would all be valuable assets.


Because the real driving vision underneath this is that of life time intelligent agent that learns about you, learns how to help YOU learn the most effectively. This is a very tricky, sensitive area.

I have two major criticisms of this vision however: Research skills, critical thinking and knowledge discrimination are crucial mental faculties that take years to develop – the intelligence knowledge mapper somehow must not compensate for these skills.

The agent should be a valuable assistant to acquiring skills and knowledge – it’s boundary stops at acquiring the knowledge FOR you. By learning how you learn best, it could also be argued that you may not get practice in handling less useful learning methods and pitfalls.

Both my critiques here need to be answered via implementation – if the methods of instruction and constant feedback are well built and appropriate, then at the least the instruction should match the best of what is out there today on the internet.

One aspect of my first criticism bears closer scrutiny. I’m concerned about research skill and critical thinking – would this knowledge mapping agent reduce this? I don’t believe so – right now I’m doing various on-line courses, everything I could want to know is available to me at the click of a button. The courses give me inspiration and context for even considering concepts I never knew existed. One course in particular, mathematical thinking, has a text book – a very cheap, slim, textbook that should take 5 minutes to read. It’s all there for me.

So far, I’ve understood a 3rd of it.

Availability of knowledge, a clear map of what to study and why have done nothing to make the act of actually doing the learning easier. They’ve merely made getting AT the materials in question easier. The latter was inevitable, students of the future may indeed exchange skills at library checkout and research for skills at pure knowledge acquisition – where knowledge is only considered acquired if it is thoroughly understood, could be acted upon and is certainly properly integrated in relation to existing knowledge.

That’s the point of the knowledge mapping agent. To make this process clearer and far more reflective – you can see your mental maps grow as you work through the processes offered.

At this point I’ve hinted at the processes of presenting and testing knowledge acquisition. This stage is critical and in many ways harder than the differential map problem.

Simple multiple choice or adaptive multiple choice quizzes alone won’t cut it. They may assist with aspects of assessment for sure, however how does even a 1 in 5 chance test prove attention and acquisition? At the moment they’re gambling – it’s a probability only.

If the goal here is to produce semantic richness, native understanding, conceptual fluidity and ultimately expertise then both the presentation and the constant assessment has to be similarly deep and rich.

This last point is probably a pedagogical holy grail!

  • I’ve always believed that so called genius is created: by a combination of fate or luck, and hard work. Hard work – mandatory but luck dictates whether you are working hard on the right things that will result in the greatest good work you could ever achieve. This is far far more potent a factor than many admit to of course but examine the distribution of wealth and success in the world and tell me that all very wealthy people merely worked hard? All geniuses however, I certainly believe did. My vision in reality is to sway the odds of any one person discovering the right things to be learning in the right way to get the greatest success.