Fully Responsive Theme
Resize your Browser to see the Effect
Parallax Effect
Scroll and Notice the Header Image

Graphipedia, Context and knowledge

Motivation – The Grand Plan?

Rather large caveat: The approach below simply extracts links and builds a graph database based on page title and links within it, you get the neighbours of any one page this way.

However, when I started I was after what links and WHY. The second aspect is a much more challenging aspect – discussed further down.

Quite simply I was studying something completely different, gene expression programming, and realised I knew nothing about neural networks. So I backtracked and started on these, to find I didn’t have a deep grasp of stats and probability. I’d been doing Coursera courses and kept seeing similar key concepts bandied about: Markov, Monte Carlo, percolation, model thinking and many more. Things I’ve half a grasp of.

I’d seen Graphipedia on GitHub before and I like* concept mapping*. Neo4j stores the relationships AND the item – a perfect concept mapper!

I want a way to automate the tedious aspects of building contextual understanding – finding out what do I don’t know and need to know in order to understand this new thing? You could say I wanted an automatic ignorance mapper!

Can I get a tool that will tell me what concepts I need to understand other concepts and what other concepts knowing the first concept helps me to understand. <<|Yep, I really did this with that sentence.

I burnt a lot of bandwidth and decided to get ALL of Wikipedia, from here:

Wikipedia Data Dump

Given that I’m allergic to wheel invention, I performed Google fu and found this handy Java program:

Graphipedia – Original Author

Graphipedia – Source

Building Graphipedia requires Maven – go here for this and the command mvn install in the folder you extracted Graphipedia to. I had to set my JAVA_HOME correctly to allow Maven to run.

This blog helped

There are two stages to the import task:

Extracting links

java -classpath graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks wiki.xml towiki-links.xml

Importing them to the graph database

java -Xmx3G -classpath graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph towiki-links.xml graph.db

These stages took about twenty minutes on an i7 9xxx with 32gb and ssd.

Now the fun part: you end up with about 7gb of a neo4j graph db that needs a little persuasion to load up. I had numerous issues with garbage collection blocking and memory performance.

One noteable point, the type of relationship is Link. That’s all, along with the type of node being a page with one property, it’s title.

So to take this further, more steps would be needed. For example, predicting the type of link using machine learning would potentially yield a graph with meaningful relations. However, this is not a trivial problem if we are only relying on the content of wikipedia.

Match (p0:Page{title: 'Neural Networks'}) RETURN p0

Screenshot 2014-03-23 00.46.07

Finally, this graph database has significant performance issues – partly to do with my own level of knowledge of course. Being 7.8 gb with only one type of relationship probably doesn’t help either.

However, here are the settings I’ve used to get it going from the Neo4j properties:

neostore.nodestore.db.mapped_memory=150M
neostore.relationshipstore.db.mapped_memory=1000M
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=500M
neostore.propertystore.db.arrays.mapped_memory=10M

# Enable this to be able to upgrade a store from an older version
allow_store_upgrade=true

# Enable this to specify a parser other than the default one.
#cypher_parser_version=2.0

# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true"
keep_logical_logs=7 days

Neo4j Performance

The JVM settings I’ve altered to maximum heap before garbage collection occurs:

 -Xmx1G

Before anyone says it, I’m not a Java developer so please look here if you need proper help with this aspect:

Java Flags

Final Thoughts

I’m intent on building a simple knowledge differential engine and then use this to develop an intelligent personal learning agent. Both aspects will require lengthy investigation into the world of AI and machine learning – it’s a recursive project as I started it BECAUSE I knew that there’s lots that I don’t know that I don’t know <|* Yep, I did it again!*

The concept is touch the concepts you want to know about, your presented with various concepts that are related, you indicate how much you think you know about each. Other assessment modes may be adapted over time as there is a lot of existing work to examine on this.

Ultimately, the software learns enough about what you know to provide efficient, relevant concept maps to assist you in building your own mental cognitive map of any complex area.

The act of learning isn’t something that should be automated but the act of finding a map for the territory is perhaps one that can be made far more efficient, leaving the learner with just the act of getting to grips with the relevant stuff.

Solving this, originally meant a convenience tool for me in learning about AI and other complex topics.

Now, I see that to do it, I’ll have to know a fair bit of practical AI, stats/probability, parsing and all that, graph databases and all sorts of intermediate items.

It’s a little bit recursive!

Concept mapping I find to be superior to the later, popularised mind mapping because, like Neo4j, you are storing the relationship type between concepts. In mind mapping, everything springs from a central organising concept whereas in concept mapping, such a mind map would be merely a node.

I’m going back to the drawing board with this because ideally this needs to be an on demand agent that pulls what it needs, when it needs it, intelligently. I’m expecting to have the opportunity to use evolutionary programming like gene expression programming, neural networks, F# and a few other tools to realise this little concept. However, I also expect that I’ve underestimated the complexity of the task by about 2 orders of magnitude!