Rss Directory > Computer > Software > Monrai Blog
 
Here is a great paper on the case for the use of NLP in Semantic Web applications.

CNL_Reportv7.pdf (application/pdf Object)
I finally got around to looking at MicroFo...s today (which I will henceforth refer to as MFs, if you know what they are, keep reading, if you don't know what they are, good!), and was amazed at how redundant it is. It's primarily RDFa watered down. But then something else struck me... it is actually being USED, by a growing number of software developers, web publishers, and end users. It seemed to catch fire rather quickly (relative to the growth and development of the W3C endorsed Semantic Web).

As I continue offer my response to Dan Grigorovici's blogs on the business and marketing aspect of the Semantic Web (or more accurately, the lack thereof from his prospective), I would like to turn the attention to the growing attraction to MFs over RDFa as offering one of the sources of the problem. According to the Wikipedia page, MFs is a grassroots effort, which gained popularity and then support from a corporate sponsor. It's not a standard, and is steered by the very loose-knit community, essentially, a mf begins it's life as a wiki entry. Question: What drives an end-user's decision to "roll his own" verses adopting an off-the-shelf solution? Answer: Familiarity. Neurologist know that the task of creating a new synaptic connection in the brain requires many times more resources and energy than reusing an existing neural pathway. So reuse, leveraging existing knowledge/practices is an aspect that is always sought after by the brain. It's often times infinitely easier for a developer who is proficient in HTML/CSS, to build a site from his own personal libraries of (familiar) templates, than to be given an existing site to remodel. And with MFs, and the Web community at large, it appears that there is less friction involved in starting from HTML --> HTML+semantics, and easily seeing the relation, than to one day have a RDF Primer and SPARQL specification dumped in your lap and told to go from QuadStore --> triples --> HTML+semantics. It's obvious that the Semantic Web vision is "right", because the world is obviously demanding it, but the world never wants to DO right. We must ensure that we arrive at the right destination, and get there the right way. If I set out to the store, and drop all my money on the way there, then what good is it if I make it to my destination?

I think that the momentum building around MFs indicates that the W3C's Semantic Web problem is largely one of PR and public image. If you are a good spin-doctor, I recommend the Semantic Web as a qualified client. I would go so far as to agree with Dan's arguments and Hank William's suggestion, that if you are a semantic web startup, drop the association with Semantic Web in any public or investor-facing collateral for your product or company. I really wish someone would have taken the RDFa spec, dropped the RDFa label, slapped the name MFs on it, stood up a wiki and discussion group, and proposed it from the grassroots level.

Keep in mind that by rebranding the Semantic Web, we mean only a cosmetic overhaul while leaving untouched the core the tenants we have, as a community, chosen as necessary to uphold: use dereferencable URIs, publish, use and reuse RDF vocabularies, reuse other's URIs for naming things when possible, among others. It's important that we hold true to these principles, and that we are not blown about by every wind of doctrine that surfaces regarding how the Semantic Web should manifest, or compromise the vision we've laid out. I, for one, have decided not to support MFs in any of my software, nor publish it on any of my sites. As a community, we have our view, and the rest of the world seems to have their view, let us stick to our view, and not give way to these factions. Many of us are involved in work to RDFize the existing world, as Kingsley Idehen puts it, to "build the house around the users", as opposed to dragging the user into the house. This is strategic and a form of compromise, but it can only go so far. The Semantic Web will need solidarity if it will be preserved, multiple standards split resources, focus and impede interoperability and reuse, and these splits will only weaken our foundations. Only by having one unified set of standards, emanating from one unified community, can we hope to build anything substantial on top of the Semantic Web.

Sources:
Why does everything suck?: Killing Ontologies/OWL In The Semantic Web?
Dan Grigorovici, AOL exec, semantic web evangelist, and good friend of mine, is doing a series of articles for SemanticWeb.com. In it, he makes a call to action to the semantic web community, and admonishes us to buckle down on the PR aspect of the semantic web (or the lack thereof). I'd like to attempt to offer some responses to the issues he addresses.

I was having a conversation with a few of the "usual suspects" of the Semantic Web evangelical crowd, and it was mentioned that one of the problems we face is how do you make money in a medium that is not eyeballs-driven. Because the semantic web is a technology who users are mostly made up of machines, the catch-all monetization strategy of Web 1.0, advertising, does not apply. Someone then took the words out of my mouth by saying something to the effect that you don't have to have people looking at a web page to deliver an ad to them, the ad can be delivered across other mediums, SMS, etc. The problem is that current advertising is obtrusive. I am reminded of how one time I had a really big headache, and I went to the drug store to buy aspirin, and found myself in the aisle asking "Now, what's the name of that 'I have a headache this big' medicine?". A classic example of a good product that was offered to me at a bad time. So an improvement that is needed is the injection of "context" into the equation, being able to deliver to a user a product, service or opportunity, at the most opportune and relevant moment, based on their current need, time, and place. At the time I saw that commercial (btw, the brand is Excedrin), I was a small child and probably had never had a headache. But when I finally entered the market for it, I was unable to find/recall it.

I added that user behavior, interests, etc can be collected (by permission) and used to drive more intelligent referrals for purchase decisions. People are always looking for better advice before buying, case in point is an experience I had with a crummy airline. Had I had a service that could have made a quality recommendation on my airline (taking price, quality preference, and other factors together), I would have literally saved hundreds of dollars.

So one business model that will, IMHO, be a bread-and-butter source of revenue for Semantic Web companies will be in "cooking" triples that describe users and the things that interest them, to provide the knowledge/intelligence needed to fuel the next generation of recommendation services. These services will connect customers to the things they want and need with laser-like precision, and will deliver these laser-beam recommendations unobtrusively across a myrid of channels. Companies will pay a premium to have their products and services delivered by such services.

Stay tuned for the next micro-post, where I continue to offer monetization suggestions for Semantic Web startups.
A lot of people have requested the presentation "Emergent Data and Semantics From Social Collaboration", prepared by Soren Auer and myself, for the Linked Data Planet 2008 Spring conference, so I have placed it online. (For now, as you read, please refer to the slides, I'll get some images posted soon).

In it, I expound on the trend towards a more High Resolution Web, or High Definition Web, where machines are able to see a richer description of people, places and things. Here is a bit of my notes from that talk for those who could not attend.

When we talk about Social Collaboration, as social creatures, sharing is a natural component of our evolutionary adaptation. The internet provided an infrastructure to connect computers, and WWW provides the means of performing this inherit human behavior of sharing, mainly documents across that connection. One of the greatest contributions the WWW made was that I could open a text editor, write something, then instantly share it asynchronously with someone across the world. But the document web limits sharing, i.e. the Social part of the WWW. Here are a couple of analogies that help illustrate this notion:

The Teller that Couldn't Tell:
- Suppose you deposit money, then I request a withdrawal, and the following conversation ensues:

You: I'd like to withdraw $20.oo please
Teller: Let me search for that, I’ll be right back... Ok, I found $10 that may be yours [or] I found $5 of the $20 you requested. Instead of telling me the amount you deposited, can you tell me what you had on when you deposited it, that may help me cross reference and find your deposit better?

Ridiculous, right? What's hindering the teller from delivering exactly what you requested?
-No matter if it’s an entry posted to my blog, or link sent to your email, or a dissertation in a PDF, or a web page, the only reason we have a notion of “search” and “results found”, is that documents are inadequate data containers that wind up suppressing the information we intend to share. The WWW, email, blogs, delicious bookmarks, etc., the document always looses important parts of the data we place in it. Because of this, the document must be searched for and founded again.

The Powerless Boss:
Suppose you have a boss who has a collection of many thousands of photos stored on your PC. He asks you one day to find a certain photo he took at a conference, he describes the photos in vivid detail. The problem is, you have this incredibly low resolution monitor, the figures in the photos are blurred beyond recognition, you can’t make out any of the people’s faces, how on earth will I you find the photo he's interested in? So you begin creating alternative heuristics for finding the photo, you think "he said he took it along side three people, there are a few with four human shaped objects, I can try to determine which one is him by cross referencing and narrow down…, well, he also took one that day at the podium, thankfully there’s only one with a human shaped form at a podium looking thing… and it’s shaped like and is the same color as the blob in this photo… one of these three are most likely him." So you email him the candidates, he prints them and selects the correct one, then says “Thanks so much, now I need the photo of me discussing the market data powerpoint slide”. Based on his feedback, you make a note that says “The tall purple blob in these photos is the Boss”. But then you then explain to him, "Hold on boss, all the detail you provide in your request is useless to me" (then you explain to him the situation)... "you’ll have to speak in terms of colors and blobs (i.e. please dumb down your request)".

He says: “Hmm, ok, the picture I want should have a tall, slender, dark blob left of center, and three smaller blobs to the right, because by that time two of the panelists had not gotten there yet”. Two photos match, you send, boss prints and selects the correct one from what you gave him, and you use that good guess to improve the heuristics in your little book. Your monitor’s terrible resolution introduces a tremendous pain for your boss, but gives you great job security, because of the tremendous value your book of heuristics now offers.

But now, let's take a look at what happens the moment your boss increases the resolution of your monitor:
  • Your book of heuristics becomes worthless
  • Your boss can now fire you anytime and hire anyone else to retrieve his photos
  • Most importantly, your boss can now request a photo from 1000s by describing the photo he wants in vivid detail, and can be fairly certain that he will receive the photo he request (if the photo exists), so he can say things like "I need some photos for my homepage, get me all photos of me taken when I still had a beard, and taken outdoors wearing no suit, at my home, or taken at a bar with anyone I know"
Now think of the description of a resource as a photo of that resource, and each statement (triple) involving that resource as a pixel that makes up the photo. Because documents were the atomic unit of information, the web had a really, really, really low resolution, and Google held a very valuable book of heuristics. As we increase the resolution of the web, the emphasis on "search" will evaporate.

The Trend Towards a Web in HD

What we’re moving towards with the Linked Data Movement, and the Semantic Web movement at large, is what can be described as a High Definition Web (i.e. Web 3.0, where each version increment roughly corresponds to a decade). The web has always been about describing things.

Web 1.0 contained statements where documents referred to nouns and you only had one verb isSomehowRelatedTo. Anchor tag is a reference to the relationship isSomehowRelatedTo. If you think of information (i.e. a statement) as a pixel, Web 1.0, if a document only contained one hyperlink, the pixels that make up it’s photo were few, or it may have many inbound and outbound links, but because each link means the same thing, it had no color (i.e. the link had no distinction)

Web 2.0 introduced subjects of several new nouns types, same monolithic verb isSomhowRelatedTo, and an object of type ambiguous term i.e. tag. Web 2.0 increased the number of pixels just slightly, but still no real color.

Web HD completes this transition by offering all named entities as subjects and direct objects, and any relationship as verb. Web HD is like having a life-like photograph of a thing, we can say this is a person, we can describe their phenotype, their genotype, likes, dislikes, social relationships…, each statement can now offer distinctly different information (so you have this wide range of color), and because you have this rich and inexhaustive vocabulary, the number of pixels in the photograph explode.
I'm blogging live from the LDP conference, and have seen some very exciting technologies and heard some excellent presentations of the linked data vision. In my talk on tomorrow, I discuss the differences between todays web (Web 1.0 & 2.0), which is primarily a web of opaque documents and the simple "isRelatedTo" links between them, verses tomorrow's web vision which offers links between granular semantic (i.e. non-ambiguous references to self-described) concepts. Thus, instead of the document (and links between them) being the atomic unit of information, the database becomes the container.

But Kingsley today demoed something I had not thought a lot about... what if you make the document the container of these richer, semantic statements. RDFa is a standard for embedding RDF into HTML documents. But take a look at Kingsley's keynote presentation (which is a Powerpoint document), or rather, the linked data embedded in it. This graph allows you to explore the slides in the presentation, the concepts it discusses, resources and photos it contains, people related to it and the concepts it mentions, etc.
Next week is the first annual Linked Data Planet conference, which will be held in NY. I was really excited when I first hear about this, and excited about attending, because two of the keynotes are visionaries who I have been wanted to hear speak for such a long time but haven't yet had the opportunity: Kingsley Idehen and Tim Berners-Lee. I'm really excited about this particular event, because it puts a concentrated focus on the momentum building around Linked Data, which is one of the chief byproducts of the Semantic Web. I believe that this event will mark a critical turning point for the Semantic Web movement.

I will also be doing a talk on Dbpedia, Ontowiki, and Cypher, and a new service called Cynapse. In addition, I will have a demo of some of the latest Cypher features and improvements both in presentation and in the exhibition.
I saw a post this morning about Peter Norvig's remarks a few months ago about his perceptions of NL, and how it's all but useless in providing value to web search. The post resurfaced during this weekends' buzz over Powerset. Here's my reply.

I believe the discussion around Powerset and its potential suitors is on a misguided trajectory. A few months ago, Peter Norvig stated that NL provides only marginal advances over the state-of-the-art keyword search technologies, and that key word lookup is actually more natural for users than NL questions and phrases. As a NLP advocate in general, and a die-hard advocate of knowledge-driven NLP, I am amazed to find myself in perfect and absolute agreement with Norvig's assertions. A simple and concise list of keywords are the most suitable interface for search and retrieval of text documents from the WWW.

But the focus on document search as the future of information retrieval is itself a fallacy. Google's blindspot, and potential undoing, is the insurgent linked data web, or web of data, or semantic web, or web 3.0 (pick your favorite), which has been heralded in by Tim-Berners Lee. This vision will allow the web to consist primarily of structured databases comprised of graphs linked together by dereferencable, non-ambiguous URIs. For the data contained in any segment of this global graph, and the schema encapsulating the data, the convenience of having a consistent model for data exploration, and the notion of a fixed domain of discourse to guide UI designers will become a thing of the past. The user will no longer "search for a page using keywords", but will instead "lookup an entry by description". Any one "lookup" may span dozens of domains of knowledge/ontologies/schema, and will yield result sets of such breadth and heterogeneity as would defy any attempt at achieving the GUI consistency of Google's ranked list of links. People using this gloabal graph will search not for pages deemed relevant to a bag of words based on the consensus of the crowd. Instead, users will look up people, places and things, and links and relations of varying complexity between them, using unambiguous references to those entities. In order to perform these laserbeam-like lookups, users will demand to leverage the interface they have spent a lifetime mastering, a UI that is no less natural (in the task of expressing relationships between things) than the natural language user interface (NUI), where noun phrases and named entities will allow users to make reference to a set of URIs as expansive as the NL lexicon itself, while verbs, adjectives, relational nouns, prepositions and modifiers will offer users a broad and rich set of operators for describing the links between those URIs. There is a time and a place for every purpose under heaven, and I believe this is the proper place for NL technologies. NL and the SW shall evolve together, and each will symbiotically facilitate the critical mass adoption of the other.

I believe every contributor to NL should be involved in a project which seeks fuse the semantic web with NL.
I'd like to announce the 1.2 release of Cypher and the availability of an updated user guide. This release is also accompanied by the release of Cypher Web Service, which allows Cypher to run as a RESTful web service. A public demo of the web service is available.

Below is the change log:

feature enhancements since 1.1.8
- added report.xml file which is reported when cypher.report.html is true, file contains an xml version of the report in report.csv
- added cypher.report.html.refresh to control refresh time in HTML interface
- added cypher.input.files to control whether cypher.input.dir directory will be crawled
- added cypher.output.commit to control if RDF output will be loaded into cypher.repository.output
- added cypher.output.format to specify serialization for RDF output
- added cypher.http.base used for namespace of minted URIs and also the URL of the Cypher web service
I was looking at the traffic logs for monrai.com, and saw quite a few vistors from the Google results for sparql software, so I took a look. Turns out, my site is the first result, as well as in the other first few results. My question is, how did that happen?? I also took a look at Google Trends for these search terms, but there isn't enough data on them to show up on its radar.

I don't really know a lot about SEO, but something I did must have worked, because there are so many other projects, software, etc that are far more popular than Cypher, Cyparkler, etc. I would tell you to just Google sparql software, but... :) Just goes to show, that Google page ranking isn't 100% accurate 100% of the time.

Here's a few of the (truly) most popular:
Openlink Virtuoso
Sesame
Redland
Jena
Longwell
I'm pleased to announce that after a lot of hard work and input from users and developers, Monrai Cypher beta release is a few days from being released. It is based on Sesame 2, features an entirely new lexicon and framenet based on RDF. It will be accompanied by a hosted service which allows users to set up their own Cypher instances online to eliminate the need to stand up your own server. So stayed tuned for the upcoming announcement.

Many thanks to Openlink and their team for all the support they have given to the project.
I've been a bit skimpy on the post lately as I'm currently kicking off a new semantic web venture. I'm very excited about this venture first because nothing is more thrilling to me than launching a new start up, and second because the service is something that I could find myself using today and quite frequently, as well a lot of other people I know, that coupled with the fact that it's all based on semantic web technologies and concepts, well now you understand why it's managed to command so much of my attention. Expect a beta release announcement in the next 30 days. Also, I've noticed that the interest in Cypher is still very steady, and I'm starting to see the first wave of developers who are learning the Cypher techniques and experimenting with this stuff in the lab. Please feel free to contact me with any technical questions and I can usually find a second to help you out :))

Stay tuned...
Here is a comment I posted to Nova Spivack's blog concerning Radar Network's recent announcement concerning the scalibilty of their semantic database indexing technology:
For those of you who don't know, part of our system is a homegrown distributed grid server architecture for massive-scale semantic search. It's not the end-product, but it's something we need for our product. It's kind of our equivalent of Google's backend -- only semantically aware. Like Google, our distributed server architecture is designed to scale efficiently to large numbers of nodes and huge query loads. What's hard, and what's new about what we have done, is that we've accomplished this for much more complex data than the simple flat files that Google indexes.
I've reposted my comments here for archive purposes:

A few weeks ago, I blogged about how little confidence I had in centralized approaches to semantic web database building. Giovanni Tummerello (dbin.org) wrote a great paper on the subject, and let me tell you, it's one challenging undertaking. The main challenge facing any centralized approach is what's known as the computational burden problem:
"On the WWW, the interaction is based on HTTP requests/replies that in the great majority of the cases will be of limited impact on the server (e.g serving a file). This means that, disregarding anomalous cases, both the computational resources and network traffic required by a HTTP request are bounded. On the contrary, “requests” on the semantic web are naturally expressed in query languages and, given the graph nature of RDF structured information, the complexity of execution is not bounded a priori as it is a function of the query type as well as the quantity and the structure of the data. In other words, whoever would decide to offer the ability to answer “arbitrary questions” on a SW, would easily open himself to “denial of service” situations even in the ideal, good faith usage."
Creating a centralized database that solves the computational burden problem is one of the holy grails of the semantic web. My hat goes off to you and your team for tackling and solving this problem. I always predicted that P2P networks were the only feasible solution. Giovanni's approach is to periodically synchronize each peer's database, but only from within small peer groups, and once the data has been downloaded the query is sent to the local database, thus limiting the "damange" to the user's local resources. The obvious drawback is that no one peer has 100% visibility across the entire distributed database. So if the answer to a particular SPARQL query happens to exist in triples across seperate peers, and I haven't sych'd with each of those peers or I'm not in those peers' groups, then I'm just up the creek. The ideal repository would be centralized, and accept SPARQL with the speed and scaliblity of Google, which (correct me if I'm wrong) sounds to me you guys have achieved. Again, I'm jaw dropped. For example, this will have serious ramification for my work with Cypher, as my major Achilles Tendon is the lack of a centralized repository of shared lexical descriptions (in RDF) collected from across the semantic web. If your service/framework could crawl, collect and most importantly "cook" RDF lexical descriptions (as the last item is what's lacking in current services like Swoogle), and if it can serve Cypher results to arbitrary SPARQL which queries the metadata of lexical entries, then you've just sped up natural language processing for the Semantic Web by about 5 years!
A friend sent me a link to the IKVM framework a few weeks ago, and as the week winded down, I was finally able to look more into it. For you Microsofties who build on .NET, and for the Java developers looking to interoperate with the Microsoft development world, IKVM looks to be a great solution. It provides a VM implementented in .NET, and Jave core class libraries implemented in .NET. The payoff is that .NET applications can leverage Java libraries, and visa-versa. There are of course other ways of interoperating, but this approach really allows for tight integration, which is sometimes nessassary in a integration project. There's no support for AWT/Swing, but I'm guessing 99.9% of the developers looking at this don't care. There is a potential project comming up in which I may get to use this stuff in at least a prototype environment, so I plan to post the results and experience.
Nova Spivack's new venture, Radar Networks, is finally preparing to reveal the new and highly secretive (Web 2.0/Semantic Web/Meshup/PIM ???) project they've been working on for the last few years. I am really excited to hear they've gotten so far along in development, and am ancipating hearing just what this new technology platform their building is. More importantly, what will be its impact on the Semantic Web (and ergo Cypher):
...something happened that changed my mind about this recently. I had lunch with my friend Munjal Shah, the CEO of Riya, who has an investor, Peter Rip, in common with me. Listening to Munjal tell his stories about how he has blogged so openly about Riya's growth, even from way before their launch, and how that has provided him and his team with amazingly valuable community feedback, support, critiques, and new ideas, really got me thinking. Maybe it's time Radar Networks started telling a little more of its story? It seems like the team at Riya really benefitted from being so open. So although, we're still in stealth-mode and there are limits to what we can say at this point, I do think there are some aspects we can start to talk about, even before we've launched. And besides that our story itself is interesting -- it's the story of what it's like to build and work in a deep-technology play in today's venture economy.
Good to hear another Semantic Web company has found backing in the venture capital community. I'll be staying tuned.
The creator of PingtheSemanticWeb.com has a post about a new Firefox plugin for detecting RDF on the web:

One of the new comer is the Semantic Radar wrote by Uldis Bojars. This plug-in for FireFox will notify you if it finds a FOAF, SIOC or DOAP RDF document on the web pages your surf.

The characteristic of semantic web documents is that they are not intended for humans, but for software agents (like search engines crawlers, personal agent software like Web Feed Readers, etc). The consequence is that humans do not see these documents, so no body really knows that the Semantic Web is growing and growing on the current Web.

This is the purpose of this new Semantic Radar: unveiling the Semantic Web to humans.

The Semantic Radar: much more than that

This plug-in is much more than that. Effectively, each time it detects one of these semantic web documents, it will notify PingtheSemanticWeb.com web service.

This is where the interaction between semantic web services and applications are starting to emerge. Now Web browsers will detect semantic web documents and notify a web service acting as a central repository for semantic web documents
I had the thought to extend Cypher to query the PingtheSemanticWeb.com service to detect Cypher datasets, and to notify when it has loaded new datasets created by the user. My question is, is there a way for my software to detect only the RDF documents it is concerned with ( i.e. Cypher dataset documents)? If so, I think developing a simple ontology that can be used to wrap Cypher dataset documents into, basically to point to their location on the web and other metadata, then having Cypher to download the datasets would be an excellent project.
Marcus Hutter has announced that a 50K purse will go to the developer of an algorithm which can compress the first 100MB of Wikipedia better than its predecessors:
Being able to compress well is closely related to intelligence as explained below. While intelligence is a slippery concept, file sizes are hard numbers. Wikipedia is an extensive snapshot of Human Knowledge. If you can compress the first 100MB of Wikipedia better than your predecessors, you(r compressor) likely has to be smart(er). The intention of this prize is to encourage development of intelligent compressors/programs.
If anyone wins the prize using Cypher's impeccable pattern-matching capabilities, we'll humbly accept your gratitude :)

More from Ebiquity Blog.
A new Cypher release is available. This is a bug fix release:

Version 0.7.2

Fixes: from 0.7.1

-- Hardcoded reference to smonroe login for Sesame server now removed

With previous versions, users had to create a Sesame account to match the account in the Cypher config file. This fix allows users to change the config file to match their own Sesame login info.

Update:

There was also a hardcoded reference to the two default Sesame repositories which was also found and fixed.
I ran across a centralized RDF search engine. Swoogle. From the site:

Swoogle has a collection of over 1M error-free RDF documents collected from the Web and an additional ~700K documents that have embedded RDF, are malformed but appear to be RDF, or are no longer accessible. We’ve intentionally limited the number of simple RSS and FOAF documents in the current collection.

A centralized database has obvious benefits, in an ideal world, a Google would crawl RDF documents and serve up queries through one central interface. But RDF isn't HTML, nor does SPARQL lend itself to any sort of straight-forward keyword mappings. Building a centralized database to process billions of open-ended queries per day is a mammoth undertaking. It appears that Google, who perhaps is the only company on the planet with enough imagination, incentive, and expertise to effectively build such a centralized database, is also the company who is most skeptical about the viability of the Semantic Web. The Semantic Web may also pose inherit threats to Google, who has built its empire on algorithms which attempt to address the deficiencies of the unstructured World Web Web.

I therefore believe that the path of least resistance for bootstrapping the Semantic Web will be a P2P network, or at the very least, a hybrid between the two. Swoogle seems like a great first attempt, and I'll be watching out for progress made by this and other centralized attempts, but I'd sooner bank on distributed P2P approaches.

The industry research firm Gartner has announced its Emerging Technologies Hype Cycle for 2006, which analyses the maturity, impact and adoption speed of 36 technologies and trends over the next ten years. Among this year’s themes of technologies eliciting significant momentum is the Semantic Web. The list includes new or heavily hyped technologies, where organisations may be uncertain as to which will have most impact on their business.
A new release of Cypher is available. This is a feature enhancement release. Now Cypher can generate the integer representation of any arbitrary natural language number:

Version 0.7.0

Enhancements: from 0.6.9

-- added new NumberTranscoder_LITERAL; allows natural language numbers to generate integer representation, the integer is wrapped in RDF literals of type xsd:nonNegativeInteger and xsd:NegativeInteger, making it consumable for semantic web applications.

-- added new number pattern grammar example to exploit number transcoder

There are also a couple of new grammar definition files which cover natural language numbers in English e.g. Five hundred twenty eight million five. But extending them to cover numbers in other languages shouldn't be a problem. The extended example dataset covers numbers up to tresrigintillion (10^102 I think, but correct me if I'm wrong). Sense so many people have been waiting for an online demo, I plan to set up the number transcoder as an intermediate online demo, especially since the input set in this case is finite.

I will post a more detailed explanation of the new dataset most likely in an article to be posted on the main Monrai website. In the meantime, try starting Cypher and entering: Your Name is some long number, for example Chris is twenty two thousand forty nine. Then look at the output file. There should be a owl:sameAs triple near the top, and one object should be the number you said. The BE verb is set to output an owl:sameAs triple, but you can easily change it to set the subject's age ( e.g. myonto:age). Also, conjunctions are not covered by the number patterns I wrote, so nine hundred and two won't match, but nine hundred two will match. I leave as an exercises for the user, the task of extending the example number pattern grammar to cover conjunctions.

Natural language numbers are normally spoken as opposed to written/typed, so speech recognition systems are probably a more appropriate usecase for this dataset.

Have fun!
Remember the Open Mind Project? Well, I recently heard about a group at MIT has taken that commonsense database and created a .NET explorer as well as a Natural Language Processing framework. Here's more from the site:
The ConceptNet knowledgebase is a semantic network presently available in two versions: concise (200,000 assertions) and full (1.6 million assertions). Commonsense knowledge in ConceptNet encompasses the spatial, physical, social, temporal, and psychological aspects of everyday life. Whereas similar large-scale semantic knowledgebases like Cyc and WordNet are carefully handcrafted, ConceptNet is generated automatically from the 700,000 sentences of the Open Mind Common Sense Project – a World Wide Web based collaboration with over 14,000 authors.
There's alot of talk in the docs about it using Microsoft IronPython, which I suppose is a derivation of Python. In my opinion, such common sense databases are akin to an RDF instance database. So while these types of databases don't explicitly offer the type of information Cypher needs to perform language processing, Cypher could be used to populate and query these databases using plain language. In addition, some data, such as type hierarchies, can be extracted from these sources to help in build lexicons. You can expect more Cypher support of such common sense resources as they continue to gain momentum.
CycCorp has released OpenCyc 1.0. The Cyc system is a database of common sense assertions (e.g. rain is wet, grass is found outdoors). A couple of years back, I wrote a Cyc microtheory transcoder as a sort of toy application for Cypher. The system translated natural language descriptions, phrases and questions into microtheories in CycL and queries. But I couldn't get enough people interested to justify the work. Looks like I might be blowing the dust off that old code.

Here's more on the announcement from the OpenCyc website:

Release 1.0 of OpenCyc includes:

  • The entire Cyc ontology containing hundreds of thousands of terms, along with millions of assertions relating the terms to each other, forming an upper ontology whose domain is all of human consensus reality.

  • English strings corresponding to all concept terms, to assist with search and display.

  • A compiled version of the Cyc Inference Engine and the Cyc Knowledge Base Browser.

  • Documentation and self-paced learning materials to help users achieve a basic- to intermediate-level understanding of the issues of knowledge representation and application development using Cyc.

  • A specification of CycL, the language in which Cyc (and hence OpenCyc) is written.

  • A specification of the Cyc API for application development.


I'm trying to find a RDF view of the Cyc database that actually exposes the knowledge using RDF semantics, if anyone knows of one please let me know.
There's a new Cypher release availible:

Fixes: from 0.6.8

-- dynamic addition of FOAF entry for proper nouns not already entered in the database
Today, I ran across Slashfacet, a generic browser for heterogeneous semantic web repositories. The browser works on any RDFS dataset without any additional configuration. The interface controls change depending on the data being viewed. Reminds me alot of Dbin. Here's the paper.
I was testing the lastest example dataset release, and discovered the following input didn't produce output:

Tom Hanks stars in The Terminal

After investigation, I noticed there was no word sense for 'star' which accounted for the in preposition-object construction. So I added it, and now the following works fine:

Tom Hanks stars in the Terminal --> RDF
the movies that star Tom Hanks --> SeRQL

As a side note, the datasets for the two 'movie' examples covered in the Cypher User Manual page are still on the way. I discovered a bug in how nominal clausal modifiers which are missing both the verb and subject are processed. This affects the pattern Actresses who played in movies with Tom Hanks. As a quick hack however, I just treated the noun phrase as having one clausal modifer with two prepositional phrases, and it parses fine. In actually though, the last prep-phrase is attached to the noun head movies: movies with Tom Hanks. And this is actually an abbreviation for: movies that are casted with Tom Hanks. The big difference is that the framework works best when frame slots are filled by clauses (i.e. verb lexemes), not nouns. By expanding the noun phrase prepositional phrase into a clause, we can now goveren the semantics of the noun phrase prepostional phrase by just referencing a verb. So, instead of adding a new feature the the movie lexeme to cover each possible prepositional phrase complement, we just find a verb which governs the semantics, in effect, reusing other lexemes. A better explination of this is on the way. Please be patient as I update the lexicon definition language to address this phenomenon.