Robert Cloud

Just another WordPress site

The Semantic Web and Addressing a Scholar’s Needs

Potential Solutions?

Initially I thought that one would have to write a custom application to address the needs of Scholars and remedy the inadequacies of bibliographic management systems.  However, I have been playing with Wikis recently and I believe that there is great potential there in managing a scholastic workflow.

In recent days I have been working frantically on learning the Cocoa development APIs and I am learning to appreciating its elegance.

There are tools which make it even easier than it has to be.  I considered C my favorite language for years before being overtaken by a Python craze.  Since then I have tried to do everything in scripting languages because it is easy and in my opinion, enjoyable.  Python has an enormous community and it is especially strong in the sciences.

Unfortunately Python, and PyObjC are rather weak, and even the developers suggest just using Objective-C, which is not particularly hard, especially if you use Xcode.  I have never used visual studio, so the experience with a powerful IDE is quite novel. The code completion and suggestions, as well as the dynamic analysis of code are very nice.  And yet, I often do prefer to use the shell.

When using the shell, it is easy to write small little experimental apps, I am using MacRuby and F-Script which provide fantastic exploratory features. Unfortunately the MacRuby irb shell is frankly lacking. It crashes frequently if I make an error, and doesn’t really provide the tools and help that I have grown accustomed to with iPython which is tremendously stable.

MacRuby does have the advantage of being directly tied to the Cocoa APIs rather than having to create a bridge.  And one can quite easily put their classes and application code in a file and then run it by invoking MacRuby.

F-Script is also quite nice because first of all it takes the place of a Cocoa NSApplication so you don’t have to instantiate one, you can directly create windows, but you don’t have to do this even. Your drawing code shows up in the terminal directly if you want, so it is extremely easy to try new things.

Semantic MediaWiki

But I digress…I meant this post to be about Semantic Mediawiki(SMW) which I have found to be an extremely powerful tool.  We are all likely familiar with MediaWiki, which powers Wikipedia.  SMW provides semantic modeling to the content which, when you create or add properties, you can do things that are not possible with traditional Wikipedia. There are examples of this on the sight. One would be querying for the largest city which has a female mayor, something that you cannot do in Wikipedia.  However if you add properties to mayor, female, and cities, you can construct a query which would return the information which can be then sorted by a property to give you a list.

Category pages in Wikipedia like cities ranked by population have to be manually updated.  With this, our query dynamically pulls data and is automatic. This is a major benefit

I have created a SMW and several properties related to programming.  If I am writing about a class I have properties such as superclass, and I can create a page that has an inline query for all of the properties that match the class for superclass, and I have a dynamic page for all of the subclasses of a class.  This is powerful.

At the footer of a SMW page, you have all of the information defined by properties which gives a nice overview.  Secondly you can export your semantic data to the semantic web, though this is a feature I have not tried yet.

Constructing a wiki or a subwiki for scholarship in the humanities would be straightforward.  If you have a page for every article you read and your annotations to it, you could have properties for Author, Argument, paradigm to give a few examples, and you wouldn’t have to create and curate those pages yourself. If you labeled things consistently it would be quite easy to find all the articles written by a specific article, or make a specific argument, or fall under a given paradigm.

Secondly, you can edit the properties in a front end of Forms which allow you to fill out all of the properties in a natural way without having to know much about wiki editing.

I am personally running a Debian virtual machine on my secondary iMac(first generation intel) and it seems to be very responsive. I only have two GB of RAM installed in the machine and the VM is taking up around 600 MB.

The SMW distribution I am using is called SMWplus which includes many features which make it quite a bit easier to manage SMW including a data explorer with graphical creation of properties, categories, instances(pages), and allows me to create inline queries graphically.  It has a WYSIWYG editor which is great for some, but I prefer the classic wiki text editor.  I tried to install Halo and other features in a  MAMP based wiki installation but I simply couldn’t get it to work.  The virtual machine is quite easy to install and get up and running. Just run it and it gives the IP address of the wiki to connect to.  Secondly it is portable. I intend to give my work to my supervisor and all I would have to give him is the virtual hard drive and he could run it in VMWARE player.

There are a number of research papers on the topic of Semantic Wikis demonstrating their use and the theory behind them.  I used Sente to gather many of them and will be posting reviews as we continue this blog.  I am quite interested in the topic of digital information management, and have been using DevonThink Pro Office daily, but I feel that it is great for inserting information and searching for it, it is lacking in presentation and making annotations to the data.  That is something where wikis shine, They are very readable, generate a Table of Contents automatically.  DevonThink has Wiki style links allowing you to create a web of knowledge, but it is a difficult process and does not generate a ToC for your pages which is indeed quite important.  I suffered some agony trying to make DevonThink better at accumulating my own personal notes, but I don’t believe that is the purpose of the application.

I do still love it and will continue to use it as a repository for my information, as the search features are most advanced.  I will be using the wiki for presentation and keeping track of my personal notes.

Also to be considered is DocuWiki which I was using earlier.  The advantage here are first that it stores your data in plain text files allowing them to be easily backed up or synced to a remote server.  There are also quite a number of useful plugins easily installed with a plugin manager.  SMW+ has something similar but is certainly not as useful.

On my todo list is to find a mechanism for syncing local note taking to my Wiki.  The company that makes SMW+ also sells a microsoft office connection tool, which, while i hate using Word, may be a solution to this, or I could find a way to script uploading to the Wiki.

Some Inefficiencies in Digital Research Workflows and Ideas to solve them

There is a great problem that modern scholars have to face and that is coming to grips with the massive amounts of information that flow into our lives everyday.  Flow is the optimal word to describe this, and it is generally not a tidy process; the flow is often rather gushing and out of control.  We get email, rss feeds, google scholar alerts and coupling these with our natural intellectual wandering, we quickly can quickly gather a vast amount of information.

For instance, a few days ago I wanted to read an article on some Romantic Era Scholars and Statesmen.  Naturally I fired up Sente and used a script that I have to gain access to the university library databases and quickly found about 20 journal publications that are all quite relevant and interesting.  Unfortunately I did not have the time to read over them then, and this, being only a hobby of mine, was stacked onto the pile of interesting information that should be read but hasn’t.  Couple this with several dozen books and this todo list becomes quite overwhelming.  For other researchers, particularly graduate students, reading these papers aren’t luxuries, but are tasks that they must do on a quite larger scale.  Especially in the humanities when Ph.D. programs take seven years or more, the expectations continue to grow regarding what one should know and master before being given a degree.  With the wide availability of research materials, the barrier to entry is lower but the expectations are higher than what they were fifty years ago.

One problem in the current digital research workflow is that the materials distributed are still based on the publication paradigm that evolved throughout the 19th and 20th centuries.  Ideas are spread and shared in the format of books and articles, and while journal publications have made the leap to the digital world quite well, with the distribution of PDF files from sources such as JSTOR, getting information out of a book typically requires a trip to the library.  For now, I will concentrate on articles, but book sources bring their own source of problems.

Coming to Grips with Information and Incorporating it into your own Worldview

Storing Articles

One problem in particular that I have noticed in the literature review process is the storage of PDF journal articles.  There are a number of ways to do this.  You could, most simply of all, create a hierarchy of folders on your hard disk and manage them by a process you define.  This was probably the best way before the numerous paper management and bibliographic software came into existence.  You can still manage them on your hard disk in a defined way and then index them into a bibliographic management solution.  I frequently wipe my computer due to a tendency to experiment with unstable software and kernel functions, so a managed solution(i.e. storing the PDFs inside of a bundle for a particular application) tends to work best for me.  For general purposes, I have found that one can dump their ebooks and PDFs into DevonThink and this works quite well.  DevonThink has the best searching that I have found in any similar application and has the additional plus of fuzzy logic which can find similar articles or phrases in articles quite well.

As for bibliographic managers I have tried Mendeley, Bibdesk, and Sente.  Mendeley is quite nice because it is 1) free 2)stores your documents on the cloud, providing a backup solution and 3) allows for searching in the text layer OR the annotation layer of the document.  This is quite powerful, but is initially misleading in its capabilities because Mendeley annotations are not standard PDF annotations, meaning that they will not be seen if one were to open the PDF with e.g. Preview.  One can, in fairness, write out the PDF annotations which would then make them viewable, but then not searchable.

Sente is my preferred bibliography manager because it is phenomenally easy to add new sources to the library.  The integrated web browser has plugins for all the sites that I like to visit and one can get the reference and the associated PDF to attach it to quite quickly and easily.  Unfortunately, Sente does not have search functionality for the text layer of the PDF(though the team says they are working on it and it will be in a future release).
Bibdesk is also free and does include some functionality for adding new sources through the web, however I have found that it is rather limited and it frequently does not add all fields of a bib tex entry and copying a full reference from Sente to Bibdesk has proven difficult.

What I do currently is to download new journal articles with Sente, store the reference in the Sente Library, and then have the references stored outside of the bundle in a single level folder.  I then index the articles with DevonThink adding OCR when required.  This works quite well

Annotating Articles

This is one issue that I have struggled with for a while and I think it is primarily because I have been trying to incorporate my iPad into an annotation workflow.  I have also been mainly annotating longer works, ebooks on programming primarily, which are information dense and require slow reads and frequent note taking.  If one takes notes directly on the PDF with either DevonThink or Preview, the notes are written to the annotation layer and cannot be searched.  Furthermore, if one opens the annotated PDF with Acrobat, it can be seen that Preview makes several different annotations with one pass of the highlight over a paragraph.  This would make extracting annotations difficult.  The best practice on a Mac for straightforward PDF annotation would seem at present to be to use Skim.  Then, if wanted, you can extract the annotations and put them in your DevonThink database where they would be searchable.  See:

I have written a similar program with Cocoa and the Quartz Framework that does extract the notes and highlights from a general non Skim PDF, but I am dubious about the value of such a program.  It can however be found at this link on the DT forums

Annotating to a Flat File

One commonly suggested means to get around the annotations not being searchable problem is to just use the annotation template provided in DevonThink.  There you would have a document(rtf) with a link back to the original PDF where you can put your annotations, extractions, etc.

Incorporating Annotations into Worldview

One problem I have with the annotation workflow is incorporating the ideas of an article into my own writings, thought process, and further research.  When I annotate an article in Skim or create a flat file, I am left with that.  It is then up to me to try to make it relevant into my life and mind.  If I just have a flat file with the annotations sitting in my database, I have the burden of coming back to the information, and that task may be put off until I am writing a paper or have other need for the information.  I believe there is a better solution to make the information relevant, but it requires a new way of viewing the information.

The Smallest Quantum of Intellectual Thought

The research paper or book are the traditional mechanisms of sharing information in scholarly societies, but they are not the only means and if one really thinks about it, they are really just a way to encapsulate intellectual ideas.  What are the intellectual ideas that are important? Those are the arguments that make up the scholastic ecosystem.  These arguments, described by Kuhn in The Structure of Scientific Revolutions, can either serve the process of buttressing an existing paradigm or can undermine it in the slow process of beginning a scientific revolution.  Note that scientific revolution and paradigms are just as relevant(probably more so) in the humanities as they are in e.g. Physics.

Journal articles have served as the means of communication of arguments I would argue that their continued relevancy can be attributed more to tradition and inertia.

Because we are accosted by so much information everyday, I think we need a better system to extract the relevant information, make our own comments, and incorporate it into our own writings for our eventual paradigm shift 😉

A proposed system would be this:
  • Extract Relevant Information from Article
    • Intention
      • Whether they are supporting an existing paradigm or are working on a new one
    • Salient Points
      • Thesis
      • Relevant Quotes
    • References
      • Go over their references, especially the important ideas contained.
  • Add One’s own Annotations
    • Comment on the references, their strength, the overall power of the argument, any aesthetic appeal
  • Incorporate the above into a meta-outline of the argument
    • Immediately take the relevant portions of the article and one’s own annotations and incorporate them into a document that outlines the argument on a large scale. There will be few articles in a field with immediacy to one’s own work that warrant such an independent document.  Usually you will be incorporating many such articles into the same document.  Software needs to be developed to support such a task or one can use a wiki.  A large scale argument might have leaves to other argument pages with their own annotations and structure and references

When one has this document that provides a survey of the argument along with buttressing points and where it fits in the field, I think it is an immediately useful thing and can help direct one’s research.