Linked Data Blog Aggregator

DBpedia BlogDBpedia 3.3 released

We are pleased to announce the release of DBpedia 3.3. This release is based on Wikipedia dumps of May 2009.

The new release includes the following improvements over DBpedia 3.2:

1. more accurate abstract extraction
2. labels and abstracts in 80 languages
3. several infobox extraction bugfixes
4. new links to Dailymed, Diseasome, Drugbank, Sider, TCM
5. updated Open Cyc links

You can find the datasets here, and the rdf files here. The dataset is available to be queried at our Sparql endpoint.

After eight long months without DBpedia release (due to a lack of Wikipedia dumps), today’s release will bring us up to speed again, and we will release DBpedia datasets much more often in the future.

Frederick GiassonRelease of structWSF, conStruct and the Community Web Site

The last few months have been challenging in term of amount of work to get done, in focusing on deliverables and in getting ready for the release of conStruct and structWSF sources codes, documentations, tutorials, web sites and demos.

I am now really happy to be able to finally announce the release of both software code sources along with a new development community website where users and developers can exchange ideas about these two news projects.

The biggest milestone of the last months is now behind us. However, this is just the beginning of everything!

I think that many things have been written about these two projects already. I don’t want to write any tutorial at this point. So the only thing I will do right now is to point you the more relevant documentation, web sites, blog posts and demos about each project. The next step will be to write about specific use cases, features, etc.

Community Web Site

The community Web site is a place where developers and users of structWSF and conStruct can meet to talk about both projects, to report bugs and issues, to submit new enhancements, to find tips and tricks, etc.

I would suggest you to create a new user profile on the community Web site if you are interested in communicating with other members.

structWSF

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies).

The structWSF middleware framework is fully RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

conStruct

conStruct is a distro of the Drupal framework that aims to set a new standard in data integration and as a structured content system (SCS). With conStruct, you can let your data and its structure drive your applications. You can easily interoperate your diverse internal information with public content on the Web. And you can leverage a platform designed from the ground up for knowledge management and collaboration.

AI3:::Adaptive Information (Mike Bergman)structWSF: A Framework for Collaboration Networks

structWFS

An Innovative, Distributed, Scalable Design with Dataset Access Rights

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via separate ontologies (schema with accompanying vocabularies).

The structWSF middleware framework is fully RESTful in design and is based on HTTP and Web protocols and open standards, conforming to what is known as a Web-oriented architecture. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. All Web services are exposed via APIs and SPARQL endpoints. It also has direct interfaces to the Virtuoso RDF triple store and the Solr faceted, full-text search engine.

This post follows the release of the alpha version of the open source structWSF code on the OpenStructs Web site. It is available for download under Apache 2 license.

But, Wait! There’s More!

These baseline capabilities are useful enough. But there is another foundation to structWSF that is quite innovative and exciting: Its explicit design to support collaboration networks. It is this aspect that is the focus of this current article.

The collaboration design is a result of the needs of the Bibliographic Knowledge Network (BibKN or BKN) [1]. BibKN has as one of its express purposes creating a network of collaborators in math and statistics, ranging from the individual researcher to departments and universities and various virtual organizations (VOs) representing different communities of interest. Moreover, this nucleus of researchers also has external collaborators ranging from major publishers to software and service providers of various sizes from around the globe.

Thus, one key requirement of the BKN project was to design an infrastructure responsive to this broad spectrum of interests, locations and organizations. And, besides questions of varying scale, locale and distribution, there was also the need to combine public and private data. In some cases, initial work products need to be kept within its sponsoring groups before being made public. Sometimes external publishers want to segregate network members by whether they are already paid subscribers or not. And, most importantly, the project had a mandate to create an easy and open framework for encouraging incipient collaborators and curators to add and take ownership of new datasets.

Boiled down, these requirements represent a completely fluid spectrum of scales, access rights, virtual groups and distributed locations. These requirements were daunting indeed to establish a workable and responsive framework. But, what has resulted from this mandate — structWSF — is a generalized solution that has applicability to collaboration within any knowledge network.

Four Exemplar Deployment Modes

BibKN anticipates and is to include four exemplar types of participants on the network (or “nodes’, which are not to be confused with the different meaning of node in Drupal):

  • VO Nodes – these “virtual organization” nodes are the main collaboration portals and are generally based on a content management system (CMS) [2]. VO nodes may also be publishers or consumers of datasets
  • Gateways – connections to existing external content in the native data formats of the publisher, which are converted and made available to the network in BKN-compliant forms [3]
  • Hubs – aggregate suppliers of useful datasets in BKN-compliant data formats, most often BibJSON [4], and
  • Individual dataset contributors and clients, generally located on a desktop machine.

Each of these nodes exposes its data to the rest of the network via a structWSF Web services framework. Each structWSF installation provides an access point and endpoint to the network. Through these installations, data is converted to “canonical” form for use by other nodes on the network with common tools and services provided.

In conceptual, form, then, the network can be represented as follows:

structWSF Data Model Relationships

Each node has a structWSF instance, the common network denominator, shown in blue.

A key aspect of each structWSF installation is dataset registration and access authorization. Only users with proper authorization may access or exercise certain privileges such as write or updates for a given dataset.

The other core Web services provided with structWSF are the CRUD functional services (create - read - update - delete), import and export, browse and search, and a basic templating system [see (3) in the next figure]. These are viewed as core services for any structured dataset. The current alpha release supports CSV, TSV, RDF/XML, RDF/N3 and XML, with JSON forthcoming shortly.

Rights: The Intersection of Web Service, Dataset, Group, Role and CRUD

The controlling Web service in structWSF is the Authentication/Registration WS [see (2) in the figure below]. The current alpha version of structWSF uses registered IP addresses as the basis to grant access and privileges to datasets and functional Web services. Later versions will be expanded to include other authentication methods such as OpenID, keys (à la Amazon EC2), foaf+ssl or oauth. A secure channel (HTTPS, SSH) could also be included.

A simple but elegant system guides access and use rights. First, every Web service is characterized as to whether it supports one or more of the CRUD actions. Second, each user is characterized as to whether they first have access rights to a dataset and, if they do, which of the CRUD permissions they have [see (4, 5)]. We can thus characterize the access and use protocol simply as A + CRUD.

structWSF Data/WS Access

Thereafter, a mapping of dataset access and CRUD rights (see below) determines whether users see a given dataset and what Web services (”tools”) are presented to them and how they might manipulate that data. When expressed in standard user interfaces this leads to a simple contextual display of datasets and tools. For example, under standard search or browse activities the user would only see results sets drawn from the datasets for which they have access. Similarly, users only see the tools that their CRUD rights allow.

At the Web service layer, these access values are part of the GET request. The system, however, is designed to more often be driven by user and group management at the CMS level via a lightweight plug-in or module layer.

Because a CMS may employ its own access system and protocols, the potential combinations can become quite large. Let’s take for an example a VO node in the BibKN scenario which layers Drupal (via the conStruct modules) over the structWSF framework. By including the additional third-party contributed Drupal module of Organic Groups, we also now add an entire dimension of group access to the standard roles access in the base Drupal [5]. So, in this scenario, we theoretically have these potential access and rights combinations:

  • By dataset
  • By Web service (tool) and whether that tool can potentially support create, read (access), update or delete [CRUD] operations
  • By user role (for example, administrator, owner, curator, contributor, unregistered)
  • By group (for example, SuperWhizBangs, SortOfOKs, Clueless, RockStars).

Since the group and user role categories can be quite extensive, the combinatorial result of these options can also be quite large.

Nonetheless, as a general proposition, these access and rights dimensions can capture most any reasonable use case.

Patterned Profiles Aid Management

One way to ease the management of these choices at the UI level is to create a series of access patterns or templates — called profiles — to which a newly registered dataset can be assigned. While the Drupal site owner could go in and change or tweak any of the individual assignments, the use of such profiles simplify the steps needed for the majority of newly registered datasets (Pareto assumption).

For instance, consider these possible profile patterns:

  • Profile: Public (standard) — this profile is for a dataset intended for broad public access
  • Profile: Registered — this profile is for datasets that are limited to registered users of a VO node (possibly as a way to prevent spam or to encourage membership or participation)
  • Profile: Curated — this profile is where a specific group or groups (which themselves can be flexibly determined and assigned) has curation rights for the dataset, or
  • Profile: Internal — this profile is for internal (private) datasets where only a specific group or groups may access or modify. In some instances, an internal dataset might be the profile type while the dataset is under development, with the profile shifting to a broader access category once completed.

We can now expand this concept for a given dataset by adding the dimension of user type or category. Four categories of users can illustrate this user dimension:

  • O = Owner (the original registrar of the dataset; often possibly the “owner” of the VO node, but not necessarily so)
  • G = Group member (a registered user who is a member of a specific group)
  • R = Registered user (an authorized VO node user with a Drupal login and password)
  • P = Public (anonymous user)

(Of course, with a multitude of groups, there are potentially many more than four categories of users.)

To illustrate how we can collapse this combinatorial space into something more manageable, let’s look at what one of the profile cases noted above — that is the Public profile — can now be expressed as a pattern or template. In this example, the Public profile means that owners and some groups may curate the data, but everyone can see and access the data. Also note that export is a special case, which could warrant a sub-profile.

We also need to relate this Public profile to a specific dataset. For this dataset, we can characterize our “possible” assignments as described above as to whether a specific user category (O, G, R and P as noted above) has available a given function (), gets permission rights to that function by virtue of the assigned profile (), or whether that function may also be limited to a specific group or groups () or not.

Thus, we can now see this example profile matrix for the Public profile for an example dataset with respect to the available structWSF Web services:

Data Access Matrix

Note, of course, that these options and categories and assignments are purely arbitrary for our illustrative discussion. Your own needs and circumstances may vary wildly from this example.

Matrices such as this seem complex, but that is why profiles can collapse and simplify the potential assignments into a manageable number of discrete options. The relevant question, with a quick answer, is for you to assemble profiles responsive to your own specific circumstances.

And, of course, if your pre-packaged profiles need to be tweaked or adjusted for a particular circumstance, the CMS enables all assignments to be accessed in individual detail.

A Powerful Vision

Via this design, knowledge and collaboration networks can be deployed that support an unlimited number of configurations and options, all in a scalable, Web-accessible manner. The data that is accessed is automatically expressed as linked data. This same framework can be layered over in situ existing data assets to provide data federation and interoperable functionality, all responsive to standard enterprise concerns regarding data access, rights and permissions.

This is not science fiction, and this is not complex. When combined with its data mixing and conversion potentials [3], we can now see emerging a general framework that enables access and interoperability to virtually any data source and for virtually any purpose, with permissions and rights built in, anywhere and everywhere across the Web.

These are exciting prospects that were not possible until Web-oriented architectures with structured RDF data came to the fore. There are no longer any barriers to the powerful vision of complete data access and interoperability without disrupting existing assets.

And the mere thought of that, is, disruptive, indeed.

Note: The alpha version of structWSF and its related conStruct modules are somewhat raw or incomplete in some ways. A few of the functions expressed in this posting have not yet been released in these code bases.

[1] BibKN is a project to develop a suite of tools and services to encourage formation of virtual organizations in scientific communities of various types. The project started in September 2008 with funding by the NSF Cyber-enabled Discovery and Innovation (CDI) Program. The major participating organizations are the American Institute of Mathematics (AIM), Harvard University, Stanford University and the University of California, Berkeley. Research support to BibKN has come in part from NSF Award 0835851.
[2] structWSF is actually combined with the conStruct structured content system and Drupal for the delivery of the VO nodes.
[3] See the earlier posting on, structWSF: A Framework for Data Mixing, for discussion about structWSF data formats.
[4] BibJSON is the standard, human-readable and editable data exchange format used within the BKN project. It has a standard attribute vocabulary geared to bibliographic material and is based on the JSON (JavaScript Object Notation) data notation.
[5] Though the specifics may differ, including the modules and add-ins, other leading CMS systems provide similar functionality.

AI3:::Adaptive Information (Mike Bergman)structWSF: A Framework for Data Mixing

Random Colour Swirl photo courtesy from PD Nathan at Photobucket

Interoperable Naïve Data Structs, Datasets and Canonical RDF

As I noted in my review of SemTech 2009, one of the key themes of the conference was data federation. Unfortunately, data federation has been a term a bit out of vogue for a while. (Though I still think it best captures the space.)

The current vernacular has been pushing forward an alternative: data mixing. One of the larger product pushes at the conference was by Zepheira for its new Freemix service and product. Freemix is a hosted service largely built around the Exhibit data display application, aided by some tools to make creating an exhibit easier. Exhibit is an attractive presentation system; for nearly three years AI3’s own Sweet Tools dataset listing of semantic Web and -related tools has been presented via Exhibit.

Freemix looks promising and is now being offered in beta. But one thing caught my ear when listening to the company’s announcement: they are not yet able and ready to show the “data mixing” part of the system. Its release is apparently being delayed until later this year because of the difficulties encountered.

This post coincides with the release of the alpha version of the structWSF code on the OpenStructs Web site. It is available for download under Apache 2 license.
We’ll be blogging a few more times in the coming days regarding other possible uses and applications for this platform-independent Web services framework.

What is Data Mixing and Why is it So Hard?

As a new term there is no “official” definition of data mixing. However, I think we can consider it as generally equivalent to the older data federation concept.

Data federation is the bringing together of data from heterogeneous and often physically distributed data sources into a single, coherent view. Sometimes this is the result of searching across multiple sources, in which case it is called federated search. But it is not limited to search. Data federation is a key concept in business intelligence and data warehousing and a driver behind master data management (MDM).

As I first wrote about data federation about five years ago [1]:

Data federation first became a research emphasis within the biology and computer science communities in the 1980s. At that time, extreme diversity in physical hardware, operating systems, databases, software and immature networking protocols hampered the sharing of data.
Yet it is easy to overlook the massive strides in overcoming these obstacles in the past two decades.

The Internet and its TCP/IP and Web HTTP protocols and XML standards in particular, have been major contributors to overcoming respective physical and syntactical and data exchange heterogeneities. The current challenge is to resolve differences in meaning, or semantics, between disparate data sources. Your “glad” may be someone else’s “happy” and you may organize the world into countries while others organize by regions or cultures.

Resolving semantic heterogeneities is also called semantic mediation or data mediation. Though it displays as a small portion of the pyramid above, resolving semantics is a complicated task and may involve structural conflicts (such as naming, generalization, aggregation), domain conflicts (such as schemas or units), or data conflicts (such as synonyms or missing values). Researchers have identified nearly 40 distinct types of possible semantic heterogeneities [2].

Ontologies provide a means to define and describe these different worldviews. Referentially integral languages such as RDF (Resource Description Framework) and its schema implementation (RDF-S) or the Web ontological description language OWL are leading standards among other emerging ones for machine-readable means to communicate the semantics of data.

Fortunately, we have climbed most of this data federation pyramid. The stumbling block now are the semantics. This is made all the harder when we place too much burden on the data transmission or “packet” itself. In other words, does exchange also carry with it the burden of meaning? The rest of this post tries to explain what I mean by this and how it relates to our new structWSF Web services framework.

Is it Apples or Oranges?

Not to pick on any one thing or any individuals, but three recent threads on semantic Web-related mailing lists help illustrate in various ways some interesting mindsets. While there is much on each of these threads of other value, I’m only focusing on a narrow topic from each based on my thesis at hand.

And, what is that thesis? It is simply that we too often mix instance record and attribute assertions with schema representations and world views. And, when we do, we sometimes make mountains out of molehills (or mix apples and oranges to completely mix metaphors).

Example 1: Squeezing RDF into JSON

JSON (JavaScript Object Notation) is a data notation or syntax, easily created and widely used for current Web apps. It has a rather simple syntax for representing attribute-value pairs. Many useful tools and parsers for the serialization exist.

In keeping with his general and broad criticisms of how the semantic Web standards and approaches have been promulgated by the W3C to date, John Sowa most recently expressed his ideas in a posting to the ontolog-forum mailing list under the heading of ‘Semantic Systems’ [3]. In this thread, John proposes:

1. The recommended exchange form for RDF will become JSON. Any JSON documents that are limited to triples can use the old XML-based RDF form, but they can also use the more compact and more general full JSON.

Then, in a subsequent posting to that thread he notes:

5. The W3C made a major blunder with a one-size-fits-all approach that tried to use a document tagging language as a knowledge representation language. The result was the *worst* notation for logic ever invented.

Finally, he goes on to note in a further post:

JSON could be used as an alternative to XML for the syntax, but the lack of a standard semantics for JSON means that it could *not* be used as a replacement for RDF *unless* an official standard were adopted for mapping RDF to and from a particular subset of JSON whose semantics was defined in Common Logic.

All of this John proposes in the spirit of:

The goal of my proposal is nothing less than a total *integration* of the Semantic Web methodologies with the methodologies that have been used in the traditional software development community [3].

I find common ground with a couple of the ideas in this proposal. First, accepted formats like JSON should have a prominent place in data exchange. Second, leveraging methodologies used in the traditional community is definitely a good thing.

But John, while suggesting reuse of existing traditions, is also paradoxically recommending a wholesale replacement for RDF. He is also positing a single exchange standard (JSON). And, he stops tantalizingly short of recognizing an important truth that I’m sure he knows: simple instance record assertions and representations — the essence of data exchange — can and should be viewed separately from schema representations.

As I have noted in my earlier naïve data ’structs’ series, there are in fact scores of existing data transfer formats that have been adopted by their communities — and are likely to remain popular within those communities for some time — that can play a similar role to JSON. So long as the role of data exchange is kept to the assertions (”metadata”) about instances, many formats can play in the sandbox.

The role of RDF may or may not reside with data exchange. To conflate and equate RDF and JSON is to reduce the power of keeping instance record representations separate from schema and world view representations. John’s basic sensibilities, I think, could be more effectively promoted by not posing ‘either-or’ strawmen and recognizing that data exchange formats will ALWAYS be diverse and heterogeneous.

Observation: Existing and emerging data ’structs’ useful to data exchange will remain manifest in format and diversity; data exchange imperatives are a different matter from schema and knowledge representation.

Example 2: RDFa is Not ‘Expressive’ Enough

Somewhat in contrast to this thread was a different one by Martin Hepp, editor of the excellent Good Relations ontology, on the LOD (linked open data) mailing list [4]. This thread, which sensibly questions how difficult it is for mere mortals to configure an Apache server to support publishing RDF, reached further into the realm of RDFa as a document annotation language.

As Hepp states,

The reason is that, as beautiful the idea is of using RDFa to make a) the human-readable presentation and b) the machine-readable meta-data link to the same literals, the problematic is it in reality once the structure of a) and b) are very different. For very simple property-value pairs, embedding RDFa markup is no problem. But if you have a bit more complexity at the conceptual level and in particular if there are significant differences to the structure of the presentation (e.g. in terms of granularity, ordering of elements, etc.), it gets very, very messy and hard to maintain.

Further discussion in this thread elaborates the interest in having the documents in which the RDFa is embedded carry much more schema-level information.

Like the Sowa case, this raises the question of where to draw the line. Should embedded metadata in documents carry complex schema information as well? So, we now shift the focus from data exchange to schema representation.

I think this is really unnecessary since it is quite easy in RDFa to refer to a separately specified schema. By, in this case, conflating metadata transfer and exchange with schema, the bar has been raised unnecessarily high.

If we need to capture schema and world views, fine, let us do so directly and succinctly. Then, let our document metadata (in this case using RDFa) make attribute assertions about that “payload” simply and cleanly. The Web certainly does not need individual documents carrying with them entire schema representational views of the world.

Observation: Data exchange, even based on RDF (via RDFa), is best kept to the assertions of facts and attributes.

Example 3: Mixing Vocabularies

In a microformats context, Thomas Lörtsch posed some questions on mixing vocabularies [5] and how they should be interpreted. This caused an involved discussion of intent and possible implications and best practices, with discussants including Brian Suda, Peter Mika, Ben Ward and others. It also led to the start of a useful wiki page on how objects should be represented in Web pages when multiple microformats can be invoked.

For quite some time microformats, I think, have gotten the “mix” just about right. They have created well-reasoned attributes for distinct instance types and seek to keep their embedding of that information simple in existing documents. Some advocate while others question the rigor of the microformat structure; that is not the topic here.

What is interesting about this thread is that it evolved to discuss the implications and best practices when an author posts a document with more than one microformat. How do these vocabularies relate? How should we, as “consumers” of the document, parse the vocabularies?

Yahoo!’s SearchMonkey service has recognized microformats for some time, and its questions regarding interpretation and best practices in the thread were natural. But the interesting point that seemed to come out of this thread is that users will post microformats as they wish. While care and standards in the design of the microformats can help reduce confusion and conflict, it can not guarantee it. The final responsibility for proper ingest and processing likely resides with the aggregators and publishers that consume such data.

So, here, too, we have another case of asserting metadata and embedding for data exchange in a slightly different native format than RDF. Huzzah!

Observation: Standards setters and consuming agents (often aggregators, publishers or search engines) should take lead responsibility for best practices and processing attribute data, realizing that original authors and developers may not fully comply.

Revisiting the ABox and TBox Split

structWFSThese examples are a bit of a long way around the barn to reinforce what we have been arguing for some time: the need for a proper split between the ABox (assertions related to instances) and the TBox (concept relationships, schema and world views) [6]. This has been a pretty constant theme in our writing, ranging from first introductions, to its relation to description logics, relationships to existing data ’structs’, and explicit discussion of ABox and TBox roles in a four-part series.

One of the key points throughout this writing is that an ABox-TBox mindset provides a context and rigor for looking at questions such as our three examples above. In all three cases, I argue, the seeming conundrums result from lacking this mindset. Once this mindset is applied, the respective roles of various data formats, RDF, schema and the like naturally fall into place.

Of course, the Web is also a dirty and chaotic place where niceties of design and best practices are routinely ignored or unknown or purposefully rejected. So be it. This is reality. This reality needs to be accommodated. But good design can help overcome it and work to establish resilient, flexible architectures.

Of course, even though this might be good design, there is no ability to enforce such distinctions across the Web. However, insofar as key implementators are concerned (standards writers, major publishers, tools developers, industry experts, and the like) we can put in place better approaches. This mantra is at the heart of all that Structured Dynamics does — including the structWSF Web services framework, just released as open source code.

A General Data Mixing Model

So, now we can finally turn our attention to the structWSF Web services framework, more broadly described here.

There are a number of perspectives and contexts to view this structWSF framework. In this posting, we take the boundary conditions of data formats and data exchange [7]. The key question for this perspective is: given the realities noted above, what is an adaptive framework for data mixing on the Web? Our schematic answer to this question is below:

structWSF Data Model Relationships

The basic design has two key data considerations. First, all structWSF tools and Web services and schema work from the canonical RDF data model. It is the hub and common denominator for all structWSF installations. We are able to design and optimize generic tools and services (including converters) around this canonical framework.

Second, we assume most everything in the outside world to be non-compliant with this canonical model, with the data representations often naïve and incomplete. Converters (also known as translators or RDFizers) are an essential bridge to this external world, and need to be designed for re-use and extensibility.

Where the outside world is compliant, they conform to the structWSF APIs or are themselve structWSF installations. In these cases, direct data exchange and access with permission rights occurs at a dataset level (not shown).

The Naïve Part of the Spectrum

Converters are themselves bona fide Web services at the structWSF level. (Only a few are presently included in the alpha release.) While some may be one-off converters (sometimes off-the-shelf RDFizers), and often devoted to large volume external data sources, it is also helpful to emphasize one or more “standard” naïve external formats. A “standard” external format allows for a more sophisticated converter and enables specific tools to be more easily justified around the standard naïve format.

As noted above, this “standard” is often JSON or a derivative of JSON. But, just as readily, the common ‘naïve’ format could be SQL from relational databases or another format common to the community at hand. In many ways, because the emphasis of data exchange is on the ABox and instance records and assertions (and attribute extensions), the actual format and serialization is pretty much immaterial.

Emphasizing one or a few naïve external formats allows more tools and services to be cost-effectively developed for those formats. And, even though the format(s) chosen for this external standard may lack the expressiveness of RDF (and, ultimately, OWL), because the burden is principally related to data exchange, this layer can be readily optimized for the deployment at hand.

Besides import converters it is also important to have export services for the more broadly used naïve external formats. In fact, some structWSF services can be devoted to data cleanup or attribute (property) or object reconciliation (including disambiguation as a possibility). In this manner, structWSF installations could also improve the authority and trustworthiness of standard data in the wild.

Another common service for this naïve data is to give it unique URI identifiers and to make it Web-accessible, thus turning it into linked data.

The RDF Canonical Data Model

Such generic services are possible because the “highest common denominator” for the system is the canonical RDF model. Because it is the consistent basis for tools and services, once a converter is available and the external information schema is mapped to the internal structure, all existing tools and services are available for re-use. Moreover, this system and its datasets are now ready for sharing with other structWSF instances, within the enterprise or beyond.

Thus, we begin to see a network of canonical “hubs” in a sea of heterogeneity, the interoperation of which is facilitated by a structWSF framework at every network node. This design is discussed more in the next part of this series.

Some, such as Sowa noted above, would prefer a grounding in common logic (CL) as opposed to RDF. Our choice to use RDF is based on the simplicity and understandability of the data model, plus the richness of languages and standards from the W3C that surround the framework.

Even here, however, the RDF basis of structWSF need not be the final word. Because of a keen intent to keep all designs and ontologies used by structWSF firmly grounded in description logics, it is possible for the structWSF basis to be converted to other languages and frameworks such as CL that can be expressed in DL.

Bringing it Back to Data Federation

Data mixing — or more preferably, data federation — has as its heart the premise of heterogeneous and distributed data sources. It implicitly acknowledges differences in syntax, semantics and serializations.

The design and architecture of structWSF is similarly premised. While each of us may prefer one model or one format over others, we must interoperate in the real world. And that world, for many understandable and immutable reasons, will retain its diversity. Accepting this reality is a first step to adaptive design.

So, we control what we can control, and we adapt to what else exists. We have chosen RDF as the canonical data model that we can control and have embedded it in a Web services framework that is Web-based and scalable; in other words, a fully compliant Web-oriented architecture. These are the conceptual foundations to structWSF.

To be sure, structWSF in its current alpha release is quite raw in many areas and incomplete in others. But we will continue to work on it — and invite your participation to do the same — such that it can fulfill its destiny as a data federation framework for the Web.


[1] I first wrote about this while at BrightPlanet; a page is still up on that Web site with the text above. I have re-caste this material in various ways since.
[2] I have previously written on the “40 sources” of data heterogeneity. See here, for example.
[3] See http://ontolog.cim3.net/forum/ontolog-forum/2009-06/msg00210.html and continue to follow the noted thread.
[4] See the thread, ‘ .htaccess a major bottleneck to Semantic Web adoption,’ at http://lists.w3.org/Archives/Public/public-lod/2009Jun/0341.html and continue to follow this thread.
[5] See http://microformats.org/discuss/mail/microformats-discuss/2009-June/012985.html and continue to follow the ‘mixing vocabularies’ thread.
[6] This is our working definition of the ABox and TBox in specific reference to description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[7] For functionality, download, documentation or other direct materials on structWSF, please see OpenStructs.org and its related resources. There is also a Drupal instantiation of the system called conStruct, also available for download.

Orri ErlingVirtuoso loads 110,500 triples-per-second on LUBM 8000

LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.

The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 6667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.

The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.

The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.

In comparison, Bigdata reported 200K triples-per-second for the first 8000 LUBM universities on a 15 blade box. We expect to do about that much on one new dual Xeon board; we’ll publish this when this is done.

We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.

DBpedia Blog3sat TV magazine features Linked Data and DBpedia

The 3Sat computer magazine ‘neues‘ has broadcasted a feature about Linked Data and DBpedia and the roles both efforts are playing in the evolution of the Web into a medium for the publication and linkage of data.

See:

Background information:

Blog Data Space (Kingsley Idehen)Linked Data Rules Simplified

As a compliment to the most recent Linked Data Design Issues note by TimBL, I would like to add this subtle tweak to the enumerated rules:

  1. Identify or Name things using HTTP URIs
  2. Describe things using the RDF metadata model
  3. Increase link data mesh density on the Web by linking (referring) to things in other data spaces using their HTTP URIs.

If you perform the steps above, on any HTTP network (e.g. World Wide Web), you implicitly bind the Names/Identifiers of things to negotiable representations of their metadata (description) bearing documents.

Also note, you can create and deploy the resulting RDF metadata using any of the following approaches:

  1. RDFa within (X)HTML documents
  2. N3, Turtle, TriX, RDF/XML etc. based documents
  3. Programmatically generated variants of 1&2.

Related

AI3:::Adaptive Information (Mike Bergman)Data-driven Applications with conStruct

conStruct LogoThe slides for our conStruct announcement at SemTech 2009 have been posted on Slideshare.

The slideshow, Data-driven Apps with conStruct, have much on the architecture and benefits of conStruct, from the context of the Bibliographic Knowledge Network (BKN) project. The slides came from my talk on “BKN: Building Knowledge through Communities, and Communities through Knowledge.”

Enjoy!

AI3:::Adaptive Information (Mike Bergman)‘Must See’ SemWeb

Zotero Bibliographic Plug-in

SemTech 2009 is Now the Gold Standard

OK, I admit it. I’m a dweeb and a suck-up. I have just returned from the Semantic Technology Conference in San Jose (CA) and I could not be more impressed. For real semantic Web action, this was the place to be. And, I’m sure, it will continue to be so if its leadership stays intact for some time to come.

I know, it is really not my style to applaud others when I could be patting myself on the back. But, hey, this was such a remarkable meeting in so many ways that I feel I have to break from precedent.

In terms of dislcosure let me be clear: I hope I get invited back to speak again next year (only this time in a bigger forum. Hint. Hint.) But, even if not invited to speak, me and my company will be there with bells on for this simple reason: this is the semantic Web confab that matters.

Unlike others that have noted specific talks, etc., I will not do so. Not because there were not talks that warrant such visibility; indeed, there were a tremendous number. My biggest regret of the meeting is that I could only taste a portion of all of the talks because so much was going on. But, rather than singling out any specific talks, I’d like to comment more thematically.

The conference organizers reported that more than 500 paper suggestions were put forward for what ended up being about 150 speaking slots. This was in addition to many tutorials, keynotes and many rump meetups. My best guess is that there were on the order of 1200 in total attendance. It was a packed five days or so. San Jose and the facilities were excellent.

The Real Message

When tech shows reportedly are down on average 40% or so from prior years, it is pretty remarkable to have one exceed its prior attendance, which SemTech 2009 apparently did. There were real customers, real use cases and real interest at every turn and in every conversation. Many hard problems were brought forward; some without acceptable solutions yet.

The real theme I kept hearing was: data federation, data federation, data federation. The potential for semantic Web technologies via the RDF data model and OWL and ontologies for finally breaking down the barriers between data silos was hammered and probed. The timing, I think, could not have been better than to have received shortly before the conference the timely PricewaterhouseCoopers technology quarterly report on linked data in the enterprise that I reviewed a couple of weeks back.

I think we can safely say that the advances from linked data in the past couple of years have been huge enablers and eye-openers to these prospects. But I also had the sense that the discourse is now moving beyond linked data as practiced so far. Web identifiers and Web access, I think, have won the argument. It is now time to move on to real data, interoperability and efficient tools and build-out.

Making the Pragmatic Prominent

To be sure there were discussions of more consumer-oriented apps and search. But the major energy and action seemed to center on the enterprise.

The idea is how can RDF bring us leverage, not replace what already exists. After 30 years of frustration, how can we finally solve the data federation problem? How can we remove the historical brittleness of applications and report writers? How can we actually begin to extract business intelligence from the massive data assets we already have at hand?

Asking enterprises to junk what they have for promises and prayers will not cut it. The winning strategy, and the challenge I kept hearing was: How can we layer on semantic technologies and RDF to bridge our existing data stores? How can we leave our RDBMs in place while gaining the goodness of ontologies and semantics?

We clearly see all around us the power of open source and the withering of proprietary apps and approaches. But, much data and information will remain private and needs to have access and rights restrictions. What answers does semantic technologies offer in these areas?

Then, as suppliers in this brave new world of open source and low software rents, what is the winning business model? Tom Tague helped articulate the importance of revenue models and options in his keynote; it resonated with already ongoing discussions in the hallways.

I’ve been in this space for more years than I care to admit. My observation from prior years is that some new “big thing” is identified, given blessing and push by the industry analysts (always with a new acronym), and then hyped like hell. Maybe it is the current challenged economic climate, but it feels like those days are over. For good.

Hype will not open wallets anymore. Case studies and real warts will help bring confidence that something truly different is at hand with semantic technologies. Our central challenge as suppliers to this market is to respond to today’s pragmatic imperatives. We must demonstrate more with less and faster. We must emphasize leverage and re-use. We must respect the trillions in already sunk IT assets.

Why This Matters to You

I think this matters much for three different communities.

For enterprises, I think it means that it is time for pilots and engagements. Both the market and the suppliers can not move this space forward rapidly without meaningful engagement. We’re ready, and it seems like many of you are as well. Push it with your bosses; we’ll deliver.

For the linked data community, where do you go next? I, too, heard some of the criticisms about too much “ontology.” But such discussion risks wasting the gains already achieved. If we do not listen and respond to the market’s imperatives and voice, we will become irrelevant. Let’s accept linked data as a tremendously helpful step in an ongoing progression, but continue to mature.

For some of the more established semantic technology providers, we have to make it simpler and faster. Expensive ontology development, too, will become irrelevant if we are indeed going to replace conventional software development with data-driven apps. Fortunately, I saw much, much exciting in this space and really had my eyes opened to tremendous innovation.

Outside of the venue, I heard from some of my prior Silicon Valley colleagues that this was the most constrained VC situation they have seen in decades. Funds may exist, but capital calls are not being made. What little powder there is, is being kept dry to triage existing investments. It is a good thing capital requirements for new start-ups have declined so much in recent years, because VCs are unlikely to fund the gap. And, aside from some big, prominent initiatives like data.gov or health care digitization, most savvy observers would bet that US and EU funding will also begin drying up in the coming years.

All of this can sound like bad news, but I think it is an opportunity: As technologists and suppliers, we must be relentlessly revenue focused and deliver what the market is demanding: more with less faster while preserving existing investments.

Five Stars: The Craft of Conference Organizing

The organizers from Wilshire Conferences and their entire staff did an absolutely tremendous job. Tony Shaw, Eric Franzon, Steve Bastasini and Eric Hoffer (I know Eric, you were only pinch hitting), plus the many on-site staff, were uniformly professional and unobtrusive. Sally Khudairi on PR and the A/V and registration crew were also excellent.

I once had responsibility for an annual technical meeting that averaged more than 2000 attendees and 150 exhibitors and I appreciate how many moving parts there are behind the scenes. Things work when nothing gets noticed. My guess is few noticed any issues or problems at this conference.

The stated aim on the intro slide to each session was to educate, and the agenda certainly achieved that. A/V was professional; time was kept; coffee did not run out; wi-fi glitches were quickly solved.

Sure, like any business, there is some pay-to-play in such conferences. Big sponsors get more slots and visibility. This reality, however, was also well balanced with new voices and innovative presenters. My “to do” notes and contacts resulting from the conference will take quite some time to work through.

One of the things I really appreciated was how the time slots and composition of talks and activities were varied each day. I have not attended a meeting before that did such a good job of mixing the schedule up to keep things feeling fresh over so many intense days.

Much, thanks, folks, for a conference exceedingly well done.

Last Thoughts

If I had to note a quibble I guess it would be to start the conference with more challenges and innovations. While the tutorials are very helpful, the first opening talks, I hope, could not be quite so introductory in nature. I think things are maturing fast. But, I could be wrong. First-time attendees from the marketplace should probably guide how such events start ramping up the engine.

As I noted, I and Structured Dynamics will be back next year, and hopefully contribute in more ways as well. The venue for SemTech 2010 has changed to the Hilton in San Francisco on June 21 - 25.

So, start saving for your travel budget now. This is “must see” semantic Web. And I look forward to seeing you there in a year!

Christian BeckerPublic notice: I do not make balances

1910_beckerbalance_small2
What looked like cleverly personalized spam was indeed a honest request for a manual of an antique scale that went by my name:

June 18,09

To whom it may concern:
I have recently obtained a Christian Becker balance. Design #: 165996. Ser. #: A 8628. Style: AB5. cap. 200G
I need to know what all the knobs are and how to adjust, etc.
I need the instruction manual for the balance. Is such available or can you make a copy of one for me. I’ll be glad to pay for copying.
If no manual available, do you know of someone in the St. Louis, Mo. area who repairs them that might help me?
Thank you for your help.

Sure thing, I sent him the manual (”prepared as a service to laboratories worldwide”). Gotta love antique scientific instruments ;)

AI3:::Adaptive Information (Mike Bergman)Two New Products Show the Promise of ‘Data-driven Apps’

Drupal Logo

Structured Dynamics Announces Drupal and Web Services Frameworks at SemTech 2009 Conference

After six months of dedicated effort, we are pleased to announce two new products: conStruct, which is a set of modules for bringing structured (RDF) content capabilities to Drupal and structWSF, the platform-independent Web services framework that underlies it.

There has been some promising effort to expose RDF data from Drupal for some time, and expressing internal data within Drupal as RDFa is being implemented by others as part of the upcoming version 7 of Drupal. These are exciting prospects that we wholeheartedly applaud. In fact, they will also be great sources to our products noted below.

However, our innovation looks through the other end of the telescope: Our new conStruct structured content system (SCS) enables external structured data to actually ‘drive the application‘. We think Drupal is the perfect host to demonstrate this new paradigm of ‘data-driven apps’.

The conStruct Drupal module makes the connections between existing Drupal capabilities and the structWSF Web services framework. structWSF provides a standard suite of Web services, an innovative means to access and manage datasets, and the hooks to underlying structured data stores and full-text search engines.

Combined with the existing efforts to expose RDF from Drupal, we think these two new products now promise a two-way highway for structured data thorugh Drupal.

structWSF

structWFS

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies).

The structWSF middleware framework is fully RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import.

All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and a document of resultsets (if the query result is not null). Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

In initial release, structWSF has direct interfaces to the Virtuoso RDF triple store (via ODBC, and later HTTP) and the Solr faceted, full-text search engine (via HTTP). However, structWSF has been designed to be fully platform-independent. The framework is open source (Apache 2 license) and designed for extensibility.

conStruct SCS

conStruct Logo

conStruct SCS is a structured content system that extends the basic Drupal content management framework. conStruct enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces.

Users and groups can flexibly access and manage any or all datasets exposed by the system depending on roles and permissions. Report and presentation templates are easily defined, styled or modified based on the underlying datasets and structure. Collaboration networks can readily be established across multiple installations and non-Drupal endpoints. Powerful linked data integration can be included to embrace data anywhere on the Web.

conStruct provides Drupal-level CRUD (create - read - update - delete), data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF. Depending on roles and permissions, a given user may or may not see specific datasets or tools within the Drupal interface. Search and browse results are similarly sequestered depending on access rights. There is a core conStruct project on Drupal, with the additional optional modules of structDisplay and structOntology coming soon thereafter.

Two Accompanying Web Sites

In addition to the products themselves, two different Web sites accompany our announcement, both based on Drupal.

OpenStructs.org is dedicated to the platform-independent offerings. All OpenStructs tools are premised on the canonical RDF (Resource Description Framework) data model. OpenStructs tools either convert existing data structures to RDF, extract structure from content as RDF, or manage and manipulate RDF. All OpenStructs tools and approaches are compliant with existing open standards from the W3C. The intent is to achieve maximum data and software interoperabililty.

The main software distribution from OpenStructs is structWSF. Over time, OpenStructs is also meant to be the distribution point for user-generated “structs” in data display templating, and data extractors and converters, in addition to additional Web services compliant with the structWSF framework.

conStruct SCS is a knowledge site dedicated to conStruct and provides demos and sandboxes for the system. It accompanies the actual project sites on Drupal itself.

Unveiled at SemTech 2009

We unveiled and demoed the two products yesterday at the 2009 Semantic Technology Conference in San Jose, California. I did so during my talk on, “BKN: Building Communities through Knowledge, and Knowledge Through Communities.” SemTech 2009 is a premier semantic Web event, which has been steadily growing and now exceeds 1000 attendees.

structWSF has been under development by Structured Dynamics for some time. Its linkage and incorporation within the Drupal system has more recently been supported by the Bibliographic Knowledge Network.

BKN is a major, two-year, NSF-funded project jointly sponsored by the University of California, Berkeley, Harvard University, Stanford University, and the American Institute of Mathematics, with broad private sector and community support. BKN is developing a suite of tools and infrastructure for citations and bibliographies within the mathematics and statistics domain based on semantic technologies for professionals, students or researchers to form new communities.

An alpha version of structWSF will released for download from the OpenStructs (http://openstructs.org) Web site on June 30. The conStruct system will be released at the same time under GPL license. See its home site at http://constructscs.com or within the Drupal module system (http://drupal.org/project/construct).

Some Early Observations

Structured Dynamics has as its mission to assist enterprises and non-profit organizations and projects to adopt Web-accessible and interoperable data. These are our first product offerings geared to address this mission.

The basic premise is that the data itself becomes the application. Via structured, linked data and a combination of products and Web services, information in any form and from any source can now be integrated and made interoperable. Linked data is based on open standards to interconnect any form of relevant information on the Web — on demand and in context.

One of the most exciting aspects of the overall architecture behind these two products is their suitability to support distributed collaboration, across diverse and definable datasets, all supported by sensitivity to role-based data and tools (Web services) access.

We’ll be speaking much more on these topics now that we have this foundation in place. We also, of course, have much to learn about the deployment and use case potentials of these frameworks.

These two products signal our (SD’s) commitment to open source. We hope some of you also see the promise in these frameworks to provide an adaptive infrastructure to linked and structured data. We welcome your participation!

Frederick GiassonstructWSF and conStruct websites unveiled

I am proud to announce the release the websites of two of our products to come: structWSF and conStruct. Both products will be available in open source under the Apache 2 license. Mike just unveiled and demoed the two projects in his talk at SemTech 2009.

As we describe them on Structured Dynamics‘ website:

structWSF

structWSF is a platform-independent Web services framework for accessing and exposing structured  RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies).

The structWSF middleware framework is fully RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import.

All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

In initial release, structWSF has direct interfaces to the Virtuoso RDF triple store (via ODBC, and later HTTP) and the Solr faceted, full-text search engine (via HTTP). However, structWSF has been designed to be fully platform-independent. Support for additional datastores and engines is planned. The design also allows other specialized systems to be included, such as analysis or advanced inference engines.

The framework is open source (Apache 2 license) and designed for extensibility. structWSF and its extensions and enhancements are distributed and documented on the OpenStructs Web site.

conStruct

conStruct SCS is a structured content system that extends the basic Drupal content management framework. conStruct  enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces.

Users and groups can flexibly access and manage any or all datasets exposed by the system depending on roles and permissions. Report and presentation templates are easily defined, styled or modified based on the underlying datasets and structure. Collaboration networks can readily be established across multiple installations and non-Drupal endpoints. Powerful linked data integration can be included to embrace data anywhere on the Web.

Depending on roles and permissions, a given user may or may not see specific datasets or tools within the Drupal interface. Search and browse results are similarly sequestered depending on access rights.

conStruct provides Drupal-level CRUD (create - read - update - delete), data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF. It also provides a system for additional tools additions and expansions for this structured data. conStruct SCS is built on the platform-independent structWSF Web services framework.

Like Drupal and structWSF, conStruct is free and open source (GPL license). Versions of conStruct SCS are planned to adopt it to other content management systems (CMS).

Next

The alpha version of the code with all the proper documentation will be released later this summer. Everybody will be able to contribute to the project by enhancing/developing the core code or by extending it with new modules and web services. Stay tuned!

DBTune BlogAnd another fun BBC SPARQL query

This query returns BBC programmes featuring artists originating from France (this is just a straight adaptation of the last query in my previous post).

The results are quite fun! Apparently, the big French hits on the BBC are from Jean-Michel Jarre, Air, Modjo, Phoenix (are they known in France? I've only heard of them in the UK) and Vanessa Paradis.

Note that the tracklisting data we expose in our RDF just goes back a couple of months, so that might explain why the list is not bigger.

Wikier.org Blog (Sergio Fernandez)SDoW2009

After a shot holidays in Sardinia, I’m back to announce the 2nd International Workshop on Social Data on the Web (SDoW2009), that will be held at Washington in October, co-located with the 8th International Semantic Web Conference (ISWC2009). We’ve just send out the CfP, contributions are welcomed until July 24th.

SDoW2009

I’m very proud to co-chair for the second consecutive year this workshop with John, Uldis and Alex. Last year in Karlsruhe we enjoyed an amazing day, let see what we can get this year.

Blog Data Space (Kingsley Idehen)BBC Linked Data Meshup In 3 Steps

Situation Analysis:

Dr. Dre is one of the artists in the Linked Data Space we host for the BBC. He is also referenced in music oriented data spaces such as DBpedia, MusicBrainz and Last.FM (to name a few).

Challenge:

How do I obtain a holistic view of the entity "Dr. Dre" across the BBC, MusicBrainz, and Last.FM data spaces? We know the BBC published Linked Data, but what about Last.FM and MusicBrainz? Both of these data spaces only expose XML or JSON data via REST APIs?

Solution:

Simple 3 step Linked Data Meshup courtesy of Virtuoso's in-built RDFizer Middleware "the Sponger" (think ODBC Driver Manager for the Linked Data Web) and its numerous Cartridges (think ODBC Drivers for the Linked Data Web).

Steps:

  1. Go to Last.FM and search using pattern: Dr. Dre (you will end up with this URL: http://www.last.fm/music/Dr.+Dre)
  2. Go to the Virtuoso powered BBC Linked Data Space home page and enter: http://bbc.openlinksw.com/about/html/http://www.last.fm/music/Dr.+Dre
  3. Go to the BBC Linked Data Space home page and type full text pattern (using default tab): Dr. Dre, then view Dr. Dre's metadata via the Statistics Link.

What Happened?

The following took place:

  1. Virtuoso Sponger sent an HTTP GET to Last.FM
  2. Distilled the "Artist" entity "Dr. Dre" from the page, and made a Linked Data graph
  3. Inverse Functional Property and sameAs reasoning handled the Meshup (augmented graph from a conjunctive query processing pipeline)
  4. Links for "Dr. Dre" across BBC (sameAs), Last.FM (seeAlso), via DBpedia URI.

The new enhanced URI for Dr. Dre now provides a rich holistic view of the aforementioned "Artist" entity. This URI is usable anywhere on the Web for Linked Data Conduction :-)

Related (as in NearBy)

Blog Data Space (Kingsley Idehen)Understanding the BBC's Virtuoso Powered Linked Data Space

The BBC's recently announced Linked Data space for Programmes and Music data, joins a growing list of immediately useful "Virtuoso Powered" linked data spaces, driving the burgeoning Web of Linked Data. Others include: DBpedia, Bio2RDF, NeuroCommons etc (the click friendly version of the LOD-Cloud diagram reveals a snapshot of other Virtuoso driven linked data spaces).

Why is it important?

As a leading media organization, the BBC's use of Linked Data provides a clear beacon to other media players re. the imminence of a serious Linked Data induced sector inflection. In a nutshell, every Web Site has to evolve into a Linked Data Space: a location on the Web that provides granular access to discrete data items in line with the core principles of the Linked Data meme.

Remember, the essence of the Linked Data meme is simply this: you reference data items and access their metadata, in variety of formats via a single HTTP based URI. This approach to Web data publishing is compatible with any HTTP aware user agent (e.g., your Web Browser or tools & applications that provide abstracted access to HTTP).

How Do I use it?

There a number of very powerful things available to end-users and developers alike.

End-Users:

The most powerful feature of our variant of the BBC's Linked Data Space is the exposure of Faceted Find (think Search++ and beyond). Thus, you can go the the home page of the service and commence data discovery and exploration via any of the following interfaces:

  • Full Text Search Tab -- type in a full text pattern and then experience Linked Data Entity Ranking as opposed to Page Ranking
  • URI Lookup (By Label) Tab -- type in part of a URI and let the system auto-complete by looking up Entity Labels
  • URI Lookup (Raw String Pattern) Tab -- type in part of a URI and let the system auto-complete by looking up the raw URI
  • OpenLink Data Explorer Service -- "deceptively simple" Linked Data explorer and Data Mesher (simply type in a URI or Text pattern, then view the data via a myriad of entity type specific viewer tabs).

Once you are comfortable with at least one of the items above, you can exploit the system further by performing any of the following:

Information Architects & Developers

Disambiguated Search (aka. Search++ or Find)

In line with the time-tested "embrace and extend" pattern, we provide Full Text search capability, but unlike Google, Yahoo!, Bing and other search engines, we don't use use "Page Rank" algorithm to sort results; instead, we use an "Entity Rank" algorithm since we are dealing with an RDF based Graph model DBMS where links exist between entities across instance data and data dictionary (vocabularies, schemas, ontologies) boundaries. In addition, when you get results (by clicking "show values" or "show values with distinct counts") that list entities associated with a full text search pattern, we take a quantum leap beyond search engines by allowing you to use "Entity Type" and/or "Entity Properties" (all of these have HTTP URIs too) to set your own context for what you seek.

Much more to come in the form of BBC specific demo queries and tutorials :-)

Related

  • Live LOD Cloud Cache instance that combines BBC data with other data sets from the LOD Cloud (in a single Virtuoso RDF DBMS hosting 5 Billion+ triples & counting)

DBTune BlogBBC SPARQL end-points

We recently announced on the BBC backstage blog the availability of two SPARQL end-points, one hosted by Talis and one by OpenLink. These two companies aggregated the RDF data we publish at http://www.bbc.co.uk/programmes and http://www.bbc.co.uk/music. This opens up quite a lot of fascinating SPARQL queries. Talis already compiled a small list, and here are a couple I just designed:

  • Give me programmes that deal with the fictional character James Bond - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
WHERE {
  ?uri po:person 
    <http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
}
  • GIve me artists that were featured in the same programme as the Foo Fighters - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
SELECT DISTINCT ?artist2 ?label2
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker ?artist2 .
  ?artist2 rdfs:label ?label2 .
  ?event1 po:time ?t1 .
  ?event2 po:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  FILTER (?t1 != ?t2)
}
  • Give me programmes that featured both Al Green and the Foo Fighters (yes! there is one result!!) - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
SELECT DISTINCT ?programme ?label
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
  ?event1 po:time ?t1 .
  ?event2 po:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .
}
  • All programmes that featured an artist originating from Northern Ireland - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX dbprop: <http://dbpedia.org/property/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?programme ?label ?artistlabel ?dbpmaker
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker ?maker .
  ?maker rdfs:label ?artistlabel .
  ?maker owl:sameAs ?dbpmaker .
  ?dbpmaker dbprop:origin <http://dbpedia.org/resource/Northern_Ireland> .
  ?event1 po:time ?t1 .
  ?t1 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .
}

(Note that we just need the owl:sameAs in the above query as the Talis end-point doesn't support inference)

Let us know what kind of query you can come up with this data! :-)

AI3:::Adaptive Information (Mike Bergman)Ontology Best Practices for Data-driven Applications: Part 3

Structured Dynamics LLC

Ontologies as the ‘Engine’ for Data-Driven Applications

In my Intrepid Guide to Ontologies from a couple of years back, I noted that “Ontology is one of the more daunting terms for those exposed for the first time to the semantic Web.” And, for sure, if one starts to peruse the current discussions ranging from the Ontolog Forum to major academic symposia (not meaning to single anyone out), it is clear that the idea of developing “ontologies” is often freighted with much weight, hot air, and (by implication) cost.

This is both a shame because, firstly, it is unnecessary and not often true. And, secondly, because the whole pragmatic idea of what an ontology is and what it can do has often gotten lost in the shuffle.

To be sure, there have been massive standards efforts and EU-funded mega-projects devoted to ontologies. There are certainly cases where coordination of specific domains such as petroleum or integration with a complicated supplier base such as in airline manufacture warrant these massive, complicated ontology development efforts.

But, from my vantage, these extremes overshadow the vast majority of more prosaic, pragmatic applications of ontologies. Remember, ontologies are merely a means of describing a conceptual view of the world [1]. If one defines that “world” within focused and appropriate scope, it is surprising (we believe) how much mileage can be extracted from these suckers.

As we see a breakthrough of interest in semantic Web and linked data principles applied to the enterprise, as wonderfully described in the recent seminal PricewaterhouseCoopers quarterly Technology Forecast, also notably with a prominent focus on ontologies, I think it is time to direct all guns on prior bad assumptions and bad anecdotes. To wit:

  • Ontology development need not be a comprehensive, self-contained definition of a “big picture.” Ontologies can be focused, limited, and grow and change as needed
  • Ontology development need not be expensive. Whoever is selling six-figure ontology development to businesses ought to be taken out and shot. Start small and focused; frankly, a simple spreadsheet taxonomy or quick conversion of existing XML or metadata or vocabulary standards is A-OK to get started
  • Ontology development is not massive and static: rather, it is small and flexible and incremental as more is brought in and more is learned
  • Ontology development is not some imperative for conceptual “truth”; rather, it is a very adaptable means for stating, testing and refining stuff
  • Ontology development is certainly no massive relational schema; by its nature it is malleable with nary a whiff of “lock-in”, and
  • Most importantly, ontology development is a way of “driving” applications and user interfaces and reports.

In fact, it is the last point that no one is discussing today, but it is the most important of the lot: Ontologies, properly crafted, can be the ‘engines’ for data-driven applications.

It is this latter point that is a true paradigm shift and one of the most exciting prospects of ontologies.

Manifest Uses

Ontologies, for sure, are a formal representation of conceptual relations, a “world view.” But, that world view need not carry with it the freight of trying to describe all human knowledge. It can (and should) be restricted to an understandable scope (domain) and purpose. In that vein, what does such a “world view” need?

Let’s first talk about scope. We don’t need a “global ontology” that is accepted by everybody on Earth. What we need are focused ontology(ies) for describing things within a given problem space (whose data may reside in a single dataset or aggregate of datasets). We need to communicate how this system describes the things within its domain and how it understands the concepts and attributes associated with its problem space and data. This communication is published as the ontology. Rather than a global, comprehensive schema, we simply need these well constructed bricks, one by one.

Then, the ontology itself needs to be understandable and manageable. Ontologies should be readable by machines, but too many see ontologies solely through the lens of machines. I believe that to be a mistake. While importantly needing to be designed for machine ingest, I believe the real purpose of ontologies is for humans. How do we label things? How do we describe and define things? How do we find things? How do we organize things? How can these understandings be brought before us in the software that we use?

These types of questions lead us to the pragmatic and pull us back from the abstract. If we keep foremost the simple idea that ontologies are merely structures for how to organize (schema) and describe (vocabularies) our problem space at hand, then we can actually get on with cutting the bull and getting real stuff done.

Let’s take as an example our structWSF Web services framework that I will be announcing and demoing for the first time at SemTech 2009 next week. We developed a simple and flexible ontology to describe what a “Web services framework” should be. Then, we developed and implemented the software to make it happen. This means that an ontology development task can be seen as a specification task, too.

Pragmatic Applications of Ontologies

So, OK, what do these exhortations mean? Without respect to any particular scope or domain, let me then list below some important functional areas to which ontologies — properly and pragmatically designed — can contribute.

Conceptual Relationships

The traditional lens for viewing ontologies is as a means to express conceptual relationships. We agree.

However, ontologies need not have large and nuanced predicate (relationship) vocabularies in order to be useful. Relatively simple but powerful structures with hierarchical or part-of relationships can be very effectively employed for inferencing or faceted searching. From a pragmatic standpoint, let’s first agree on what “things” (nouns) there are in our domains, then let’s worry about how they relate (verbs).

The idea here has long been known as successive approximation: Let’s first get ourselves into the right country, then right province, then right city, then right neighborhood, then right house, and then right room. Only then should we worry about the condition of the paint or the age of the floors.

Endless harangues about “true” conceptual relations are a hindrance, not a help, to this perspective. It is much better (and faster, cheaper and more pragmatic) to put forward simple but coherent relationships than to worry about what all of that “really” means. From a business perspective, isn’t being able to utilize the assertion that the hip bone is connected to the thigh bone more important than having to await a full explication of all of the muscles and ligaments and tendons that might comprise them?

Once such simple relationship structures are embraced, then amazing inferencing power comes to the fore. If one searches on thigh bone, inferencing can also bring forward the hip because of its relationships to the leg.

Integrating Instance Data

OK, so now at least we have a coherent scaffolding of concepts and their straightforward relationships. That is, is concept A a “bigger” one (class or super set) than concept B? (Other simple relations could be substituted.) If so, we now see a bit of an organizing “world view” begin to emerge.

So, now we begin to bring in external data. But that data and its schema describe themselves differently. In one realm it is “foo”; in the other, “bar”.

While this different terminology for the same “thing” or related things may not be known at the outset, it is discoverable. And, when discovered, it is quite easy to associate the idea or concept of “foo” in one dataset to “bar” in another. In this manner, through learning and accretion, we are able to associate more and more similar things to one another.

We did not need to begin with some global, cosmic view to begin relating this data to one another. We only needed the right framework and structure that allowed this association to evolve as the learning occurred.

And, oh, by the way: this very same process is akin to documenting the organization’s institutional memory.

Orienting to Other Knowledge and Domains

Being able to relate and “classify” or “organize” some things to other things also means that we are now beginning to create a roadmap for how “stuff” in a broad sense relates to other “stuff.” For example, if I develop a detailed understanding of the hip bone, I can now bring that body of knowledge into the context above to relate this new information to the thigh and to the leg.

Frankly, at this juncture, while perhaps ultimately important, it is helpful merely to know that Domain A (hip) is somehow related to Domain B (thigh). Think of the issue more like trying to get into the right map vicinity on the globe, and not whether individual streets intersect.

Again, the mindset here is one of letting ontologies and their concepts get related knowledge bases into the same ballpark. Whether we are trying to match Little League ball players with Major League ballplayers is beside the point: accept that both are playing baseball, then decide the importance and specifics of the relationship in a later step.

Again, “ballpark” is more helpful than no connection whatsoever. Silly statements about “ontological commitments” really mean nothing. Ontologies, like any other tool, can play different roles at different times. When helping to get like-related things into the same ballpark, ontologies are easy and quite effective.

(As an aside, it is useful to note here that our efforts with the UMBEL upper-level subject ontology are solely premised on this “roadmap” purpose. In and of itself, UMBEL is not a very complicated explication of the world. But it does provide a comprehensive set of 20,000 subject concepts for orienting quite disparate datasets and information to one another. This very same approach could be replicated and then applied to the granularity of individual domains, kind of like zooming in on Google Maps, to provide similar benefits at smaller scale for domain-specific roadmaps. In fact, that is a common approach we apply in our own client ontologies, which we then also make sure we tie into UMBEL for global orientation.)

Mapping to Other Schema

OK, so with this foundation now built, we can next raise the bar a bit further. Once one begins to express these “world views” formally as an ontology, even with reduced ambitions as presented above, one still ends up with a formal specification of that conceptualization. And, that means, we now also have a basis with standard languages for mapping two disparate or separately developed ontologies to one another.

This is powerful. Through such mapping, we end up, in the memorable phrase of my colleague, Fred Giasson, “exploding the domain.”

Moreover, we also have found a means for stitching together datasets with disparate schema to one another. Voilà: We now have met the Holy Grail of data interoperability.

In my opinion, this is the money shot from all of this effort. But, again, if we set the deployment threshold to the unrealistic levels that some ontology pundits suggest, this payoff is unlikely to happen. We are not trying to state absolute, universal truth about anything nor to be unrealistically comprehensive. All we are trying to do is make defensible assertions that one portion of a world view is similar or related to a portion of a different world view.

Now, does that sound that scary? No, of course not. It is merely a reasonable and pragmatic means for relating two structures together.

A key aspect to this mapping ability is to enrich the description of our concepts with what we call “semsets.” Semsets are a listing of related terms and phrases that provide synonyms, aliases, jargon and related context for alternative ways to describe or bound the concept at hand. This terminological “grist” is the basis for relatively straightforward natural language processing techniques to suggest matches between concepts in different ontologies (which might also be combined with other ontology components such as preferred labels, descriptions or structural placements in the schema).

Like many of the points above, these semsets can be built incrementally and over time as new jargon and terminology is discovered.

Linked Data, with Federated and Comprehensive Data

These techniques of mapping datasets and their ontology structures can be leveraged still further with the proper application of linked data practices. Via linked data, we place our data into Web-accessible (HTTP) networks and give them Web-scalable identifiers (URIs). This means we can now integrate and interoperate with much external public Web data and break down our own internal data silos.

Our instance records can be fleshed out with supplementary sources to provide more comprehensive attibutes and characterizations. Uniformity of treatment and coverage is promoted. Data interoperability is finally at hand.

A key best practice to this, of course, needs to be the recognition that not all data or information is public and not all users have the same roles or should have the same access to different sets of data. Thus, to embrace global mechanisms for data interoperability, there must also be local methods for enforcing access, privacy and confidentiality.

Properly designed ontologies can fulfill this requirement, as well. By organizing information into datasets and setting profiles for access and CRUD (create - read - update - delete) rights, an effective environment for data sharing and federation is established.

Context- and Instance-sensitive Data Display

To this point, we have taken almost an exclusively data- or schema-centric view of ontologies. But, as structures, pure and simple, their structural nature can be exploited in other ways. It is here, frankly, that less is spoken of the potential for ontologies than in the more “conceptual” areas noted above.

The first of these new areas is in instance-sensitive data display. Each instance record is associated with an instance type in a governing ontology. Detecting this type means that context-sensitive display templates can then be invoked.

Detecting that something refers to a city, for example, can invoke a template providing a map, population figures, area size, city governance method and the like. In contrast, detecting an instance as a camera might invoke an entirely different display template focusing on product features or price or store and purchasing locations. Such instance-type displays are common; they are known as “infoboxes” within Wikipedia articles, as one example.

But this power of data display templates can be generalized further. What if we detect our instance represents a camera but do not have a display template specific to cameras? Well, the ontology and simple inferencing can tell us that cameras are a form of digital or optical products, which more generally are part of a product concept, which more generally is a form of a human-made artifact, or similar.

By tracing this inferencing chain from the specifc to the more general we can “fall back” until a somewhat OK display template is discovered, even in the absence of the better and more specific one. Then, if we find we are trying to display information on cameras frequently, we only need take one of the more general, parent templates and specifically modify it for the desired camera attributes. We also keep presentation separate from data so that the styling and presentation mode of these templates is also freely modifiable.

This parallel set of display structures to the domain ontology provides a highly reusable and leveraged data presentation framework. For 30 years organizations have struggled with report generators and all sorts of complicated systems for responsive reporting and data display. When driven by ontologies, this challenge is greatly simplified.

Driving User Interfaces

The careful reader of the above will note that our ontologies now have a number of interesting characteristics, all of which can be leveraged within the user interface. For example, we have:

  • Human-readable labels for our “things”
  • Alternative labels in our semsets that can characterize those same “things”
  • A readable description of each “thing”
  • An organized and logical schema for how each “thing” relates to other things.

This very information, when indexed in a supplementary full-text search engine with faceting capabilities (such as the Solr engine we use), can be leveraged in the user interface for these types of desired UI capabilities:

  • Attribute labels and tooltips
  • Navigation and browsing structures and trees
  • Menu structures
  • Auto-completion of entered data
  • Contextual dropdown list choices
  • Spell checkers
  • Online help systems
  • Etc.

This is absolutely mindblowing power!

We can now design generic tools that do patterned functions. Then, based on the data at hand and the ontologies that describe them, we can now see completely modified and tailored interfaces. And all of this is done without modifying a single line of application code!

Applications in this brave new world now consist of assembling a proper suite of generic tools, and then spending the bulk of our time on describing and characterizing our data via ontologies and refining templates for displaying or reporting the types of specific instances within our current problem space.

Conclusions

All of the points made above are doable and being done today. Properly designed ontologies can readily deliver all of the aspects noted above. Later parts in this ongoing series will address many of those aspects in greater detail.

Ontologies are not magic. Properly done — an important emphasis — ontologies are the pivot point for faster and more adaptable ways of doing business. A simple, pragmatic mindset can help.

Our perspective is that ontologies are really the “flour that gets backed into the cake”. While viewable and definable as their own structures, properly constructed ontologies actually should exist everywhere within applications and contribute everywhere to applications. This is what we mean by “data-driven applications.”

To be sure, we are suggesting a paradigm shift from 30 years of IT frustrations: schema no longer must be fragile; reports no longer must be costly and delayed; and data can finally be made interoperable.

We will continue to give you our best thinking on these topics over the coming weeks and how they might be important to you.

Sound too good to be true? Read the material above again. And, then, we welcome getting your call.

This post is part of an occasional AI3 series on ontology best practices.

[1] As used in knowledge representation or information science, ‘ontology’ is most often defined using Tom Gruber’s “explicit specification of a conceptualization.” See Thomas R. Gruber, 1993. “A Translation Approach to Portable Ontology Specifications,” in Knowledge Acquisition 5(2): 199-220; see http://tomgruber.org/writing/ontolingua-kaj-1993.pdf.

AI3:::Adaptive Information (Mike Bergman)One Twenty Years of Solitude

Olde Tyme Fax

Once a Pioneering ‘Telecommuter’, I Have Now Worked from Home for Two Decades

We had a party this week to celebrate my daughter and her boyfriend’s career move to Seattle. It was a great time with many reminiscences of our life here in Iowa City over the past decade. Then it struck me: almost to a day I had now been working from a home office for 20 years!

Wow. I had not really been paying attention. That realization in turn brought back its own memories, and caused me to reflect back on my two decades of working from home.

Why and the Enabler

I left my position as director of energy research at the American Public Power Association in early June 1989. We were just a month away from my son’s birth and had decided we did not want to raise our children in Washington, DC. The District at that time was totally dysfunctional and had earned the moniker of “Murder Capital of America.” While we loved our home in Barnaby Woods (Chevy Chase) DC and our neighbors, we wanted a smaller and safer community with more connectedness.

My wife, at that time a post doc at the National Institutes of Health in Bethesda, also was committed to her profession and career. The Washington area was unlikely to be an immediate prospect for her to find a permanent position. Indeed, our generation was just coming to grips with the new challenges of two professional families: There needs to be career choices and flexibility between the partners. Some professions, like lawyering or doctoring or sales or computer programming, have much locational flexibility. Others, such as bench scientists in biology, as for my wife, less so. In academia, position openings occur at their own place and time.

I had also been climbing my way in a corporate and office environment for more than a dozen years and was ready for my own career change. As a professional, I had never been my own boss and wanted to see how I could fare in the consulting and entrepreneurial worlds. By doing so, I could also bring flexibility to my wife’s locational options for her career.

At that time, without a doubt, the enabling technology for my new career shift was the fax machine. The ability to interact with clients with documents in more-or-less real time was pivotal. My first fax machine was a Sanyo thermal model (can’t recall the model now; it is long gone and they have since gotten out of that business). I recall buying cartoons of thermal fax rolls frequently and the copies that faded in the file cabinet drawers.

Of course, the phone and plane were also pivotal. In the early years my monthly phone bills were astronomical and I flew around 100,000 miles per year. But, it was the fax that really enabled me to cut the locational knot. But how strange: from thousands upon thousands of faxed pages in the early years to only a few per year today! The Web, of course, has really proven to be the true enabler over the past decade.Mike's Home Office

Early Lessons

Prior to shifting to a home office it took me about 45 min to commute or bike to APPA. If taking public transportation, I had to walk to the local bus stop, transfer at the Metro subway station, and then walk the remaining distance. While this gave me time to read most of the the morning Washington Post, I knew that by eliminating this commute I could save 90 min a day for new productivity.

What first surprised me, though, was the fact that I also no longer needed to keep my office computer and home computer synchronized. Since, like most ambitious professionals, I also worked some in the evenings, I had overlooked the time it took to keep files and documents synchronized. I was saving about another 30 min per day in digital transfers between home and office. With this new choice to work from home, I was saving 2 hrs per day!

I only had a home office in DC for a short period before we moved to Montana. Since Montana, I have had only two further offices (homes).

My early experience in DC suggested I wanted a more dedicated office space, so we did so for the home we designed in Montana. I was also able to design reserved office space again for here in Iowa City. (The DC home office and the interim one in South Dakota prior to Iowa were converted bedrooms, definitely not recommended!)

Planning office space in advance means you can tailor the space to your work habits. For me, I want lots of natural light, a view from the windows, and lots of desk and whiteboard space. I also needed room for office equipment (copiers in the early years, fax, printers and the like) and file cabinets. When in Montana, I designed up and had built my own office furniture suite that makes me feel I’m commanding the bridge of the Starship Enterprise (see pictures of my current office).

Teaching myself and the kids that office time and office space were fairly sacrosanct was important, too. Sure, it was helpful to be around for the kids for boo-boos and emergencies and dedicated kid’s time, and to be able to be there for home repairs and the like, but for the most part I tried to treat my office as a separate space and to have the kids do so, too. Frankly, since my family has grown up with no other experience than Dad working from home, it has always felt natural and been a matter of course.

A real key is to be able to shut the office door and return to normal home life in the rest of the house. And, of course, the need for the opposite is also true. It is probably the case that I spend more time in my home office than most regular professionals do in their organization’s office, but my career has always been my passion anyway and not what I consider to be a job.

In the early years I was fairly unusual, I think, for working from home. I certainly gave many local talks and was frequently invited by service organizations to speak over lunch on the experience of “telecommuting”. Today, working from home is no longer unusual and the Internet technology and support to do so makes it a breeze.Mike's Home Office

I have been able to run both consulting and software development companies from my home office over the years. I have seen the gamut of meetings ranging from with developers before massive whiteboards in my home basement to running and coordinating 20-person companies in their own office space with investors and Boards.

With a willingness to travel, it seems like all organizational possibilities are now open to the home worker. For quality of life and other reasons, the fact that today many larger knowledge organizations offer remote office centers and commuting flexibility speaks volumes to how far “telecommuting” has really come.

How much difference two decades can make!

Today’s Thoughts

I personally could never return to a standard office setting. For me, the home office with its flexibility and productivity and ability to find contemplative time simply can not be beat.

I really welcome what is happening in online meeting software and other Web apps that are reducing the need for travel and face-to-face meetings. For while the technology and culture has improved markedly to support working from home over the past two decades, the pain and hassle of travel has only worsened.

I have transitioned from a million-miler frequent flyer to a rooted house plant. I try to chose my travel venues carefully and when I do travel I try to do so for longer periods to absorb the shocks. It is perhaps a too frequent refrain, but it is just a damn shame how getting a meal, being treated with pleasure and courtesy, having some legroom, and getting a drink are air travel amenities of a now bygone era.

There are now many, many more of us (you) who work from home and it really is no longer a topic of conversation. A quick search tells me perhaps 5 million or more US workers predominantly work from home, with some 15% of all workers doing so on occasion. In the predominant professional and business services, financial activities, and education and health services, this percentage can reach as high as 30% of workers now doing paid work from home to one degree or another.

Of course, personality, job requirements, and physical space may not make working from home a good choice for you. But if you have not tried it and it sounds interesting, by all means: Try it!

For twenty years, it has been a great choice for me and for my family. This is indeed a nice 20th anniversary to celebrate!

Christian BeckerMarbles released on SourceForge

marbles-logoI’m pleased to announce the release of Marbles on SourceForge.

Marbles is a server-side application that formats Semantic Web content for XHTML clients using Fresnel lenses and formats. Colored dots are used to correlate the origin of displayed data with a list of data sources, hence the name.
By performing all formatting, data retrieval and storage activities on the server side rather than on a potentially thinly equipped client, the view generation can touch on large amounts of data and requests can be answered relatively quickly. Marbles provides display and database capabilities for DBpedia Mobile.

Data is retrieved from multiple sources and integrated into a single graph that is persisted across user sessions. When provided with the URI of a resource to display, Marbles tries to dereference it. In parallel, it queries Sindice and Falcons for datasources that contain information about the given resource, and Revyu for reviews. In a similar manner as the Semantic Web Client Library, Marbles follows specific predicates found in retrieved data such as owl:sameAs and rdfs:seeAlso in order to gain more information about a resource and to obtain human-friendly resource labels.

Thanks to Eli Lilly and Company for supporting the open-sourcing of Marbles in part by a research grant.

AI3:::Adaptive Information (Mike Bergman)PWC Dedicates Quarterly Technology Forecast to Linked Data

Zotero Bibliographic Plug-in

Major Report Signals the Emergence of Linked Data into the Enterprise

PricewaterhouseCoopers (PWC) has just published a major 58-pp report on linked data in the enterprise.  The report features insightful interviews with many industry practitioners as well as PWC’s own in-depth and thorough research. I think this report is a most significant event:  it represents the first mainstream recognition of the potential importance of linked data and semantic Web technologies to the business of data interoperability within the enterprise.

This entire issue is uniformly excellent and well-timed.  PWC has done a superb job of assembling the right topics and players. The report has three feature articles interspersed with four in-depth interviews. The target audience is the enterprise CIO with much useful explanation and background. Applications discussed range from standard business intelligence to energy and medicine.

The emphasis on the linked data aspect is a strong one.  PWC puts twin emphases on ontologies and the enterprise perspective (naturally).  This is a refreshing new perspective for the linked data community, which at times could be accused a bit of being myopic with regard to:  1) open data only; 2) instance records (RDF and no OWL, with little discussion of domain or concept ontologies); and 3) sometimes a disdain for the business perspective (as opposed to the academic).

PWC has done a great job of getting beyond some of the community’s own prejudices in order to couch this in CIO and enterprise terms. This signals to me the transition from the lab to the marketplace, with all of its consequent challenges and advantages.

In short:  Bravo!  This is a very good piece and will, I think, put PWC ahead of the curve for some time to come.

I was very pleased to have the opportunity to review earlier drafts of this major report.  After reading a couple of my recent papers on Shaky Semantics and the Advantages and Myths of RDF, with the latter cited in the piece, I had a chance to have a fruitful dialog with one of the report’s editors, Alan Morrison, who is a manager in PWC’s Center for Technology and Innovation (CTI). He kindly solicited my comments and incorporated some suggestions.

The report also lists 14 various semantic technology vendors and service providers.  I’m pleased to note that PWC included our small Structured Dynamics firm as part of its listing. Other vendors listed include Cambridge Semantics, Collibra, Metatomix, Microsoft, OpenLink Software, Oracle, Semantic Discovery Systems, Talis, Thomson Reuters, TopQuadrant and Zepheira, with the selected service providers of Radar Networks and AdaptiveBlue.

This report is easy — but important — reading.  I personally enjoyed the insights of Frank Chum of Chevron, a new name for me. I encourage all in the field to read and study the entire report closely. I think this report will be an important milestone for the semantic Web in the enterprise for quite some time to come.

After a brief sign-up, the 58-pp report is available for free download.

Christian Becker¡DBpedia @ Wikimania 2009!

This just got in: I will be at Wikmedia’s Wikimania conference in Buenos Aires to talk about DBpedia and an ongoing mapping collaboration. Now on to learning a few bits of Spanish…

Orri ErlingComparing Virtuoso Performance on Different Processors

Over the years we have run Virtuoso on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso.

Our test is very simple: Load 20 warehouses of TPC-C data, and then run one client per warehouse for 10,000 new orders. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.

The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.

All times reported are real times, starting from the start of the first client and ending with the completion of the last client.

Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.

Test Platform Load
(seconds)
Run
(seconds)
GHz / cores / threads
1 Amazon EC2 Extra Large
(4 virtual cores)
340 42 1.2 GHz? / 4 / 1
1 Amazon EC2 Extra Large
(4 virtual cores)
305 43.3 1.2 GHz? / 4 / 1
2 1 x dual-core AMD 5900 263 58.2 2.9 GHz / 2 / 1
3 2 x dual-core Xeon 5130 ("Woodcrest") 245 35.7 2.0 GHz / 4 / 1
4 2 x quad-core Xeon 5410 ("Harpertown") 237 18.0 2.33 GHz / 8 / 1
5 2 x quad-core Xeon 5520 ("Nehalem") 162 18.3 2.26 GHz / 8 / 2

We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 ("Woodcrest") and Xeon 5410 ("Harpertown"), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 ("Nehalem"), 72 GB RAM, and 8 x 500 GB SATA disks.

Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We'll see another time with faster memories.

The operating systems were various 64 bit Linux distributions.

We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.

We tried some RDF operations on the two last systems:

operation Harpertown Nehalem
Build text index for DBpedia 1080s 770s
Entity Rank iteration 263s 251s

Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in SQL to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.

For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.

Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.

  1. On the host OS of the Nehalem system —
    n cpu% rows per second
    1 query 503 906,413
    2 queries 1263 1,578,585
    3 queries 1204 1,566,849
  2. In a VM under Xen, on the Nehalem system —
    n cpu% rows per second
    1 query 652 799,293
    2 queries 1266 1,486,710
    3 queries 1222 1,484,093
  3. On the host OS of the Harpertown system —
    n cpu% rows per second
    1 query 648 1,041,448
    2 queries 708 1,124,866

The CPU percentages are as reported by the OS: user + system CPU divided by real time.

So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache — 12 MB vs 8 MB.

We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.

The executables were compiled with gcc with default settings. Specifying -march=nocona (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.

The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.

Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.

If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.

AI3:::Adaptive Information (Mike Bergman)Ontology Best Practices for Data-driven Applications: Part 2

Structured Dynamics LLC

The Fundamental Importance of Keeping an ABox and TBox Split

It is perhaps not surprising that the first substantive post in this occasional series on ontology best practices for data-driven applications begins with the importance of keeping an ABox and TBox split. Structured Dynamics has been beating the tom-tom for quite a while on this topic. We reiterate and expand on this position in this post.

The Relation to Description Logics

Description logics (DL) are one of the key underpinnings to the semantic Web. DL are a logic semantics for knowledge representation (KR) systems based on first-order predicate logic (FOL). They are a kind of logical metalanguage that can help describe and determine (with various logic tests) the consistency, decidability and inferencing power of a given KR language. The semantic Web ontology languages, OWL Lite and OWL DL (which stands for description logics), are based on DL and were themselves outgrowths of earlier DL languages.

Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. It is this construct for which Structure Dynamics generally reserves the term “ontology”.

The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (or individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts. Both the TBox and ABox are consistent with set-theoretic principles.

Natural and Logical Work Splits

TBox and ABox logic operations differ and their purposes differ. TBox operations are based more on inferencing and tracing or verifying class memberships in the hierarchy (that is, the structural placement or relation of objects in the structure). ABox operations are more rule-based and govern fact checking, instance checking, consistency checking, and the like. ABox reasoning is generally more complex and at a larger scale than that for the TBox.

Early semantic Web systems tended to be very diligent about maintaining these ‘box’ distinctions of purpose, logic and treatment. One might argue, as Structured Dynamics does, that the usefulness and basis for these splits has been lost somewhat more recently.

Particularly as we now see linked data become more prevalent, these same questions of scale and actual interoperability are posing real pragmatic challenges. To help aid this thinking, we have re-assembled, re-articulated and in some cases added to earlier discussions of the purposes of the TBox and ABox:

TBox TBox < — > ABox ABox
  • Definitions of the concepts and properties (relationships) of the controlled vocabulary
  • Declarations of concept axioms or roles
  • Inferencing of relationships, be they transitive, symmetric, functional or inverse to another property
  • Equivalence testing as to whether two classes or properties are equivalent to one another
  • Subsumption, which is checking whether one concept is more general than another
  • Satisfiability, which is the problem of checking whether a concept has been defined (is not an empty concept)
  • Classification, which places a new concept in the proper place in a taxonomic hierarchy of concepts
  • Logical implication, which is whether a generic relationship is a logical consequence of the declarations in the TBox
  • Infer property assertions implicit through the transitive property
  • Entailments, which are whether other propositions are implied by the stated condition
  • Instance checking, which verifies whether a given individual is an instance of (belongs to) a specified concept
  • Knowledge base consistency, which is to verify whether all concepts admit at least one individual
  • Realization, which is to find the most specific concept for an individual object
  • Retrieval, which is to find the individuals that are instances of a given concept
  • Identity relations, which is to determine the equivalence or relatedness of instances in different datasets
  • Disambiguation, which is resolving references to the proper instance
  • Membership assertions, either as concepts or as roles
  • Attributes assertions
  • Linkages assertions that capture the above but also assert the external sources for these assignments
  • Consistency checking of instances
  • Satisfiability checks, which are that the conditions of instance membership are met

As the table shows, the TBox is where the reasoning work occurs, the ABox is where assertions and data integrity occurs, and knowledge base work in the middle (among other aspects) requires both. We can reflect these work splits via the following diagram:

TBox- and ABox-level work

This figure maps the work activities noted in the table, with particular emphasis on the possible and specialized work activities at the interstices between the TBox and ABox.

The Split Should Feel Natural

Whether a single database or the federation across many, we have data records (structs of instances) and a logical schema (ontology of concepts and relationships) by which we try to relate this information. This is a natural and meaningful split: structure and relationships v. the instances that populate that structure.

Stated this way, particularly for anyone with a relational database background, the split between schema and data is clear and obvious. While the relational data community has not always maintained this split, and the RDF, semantic Web and linked data communities have not often done so as well, this split makes eminent sense as a way to maintain a desirable separation of concerns.

The importance of description logics — besides its role as a logical underpinning to the semantic Web enterprise — is its ability to provide a perspective and framework for making these natural splits. Moreover, with some updated thinking, we can also establish a natural framework for guiding architecture and design. It is quite OK to also look to the interaction and triangulation of the ABox and TBox, as well as to specialized work that is not constrained to either.

For example, identity evaluation and disambiguation really come down to the questions of whether we are talking about the same or different things across multiple data sources. By analyzing these questions as separate components, we also gain the advantage of enabling different methodologies or algorithms to be determined or swapped out as better methods become available. A low-fidelity service, for example, could be applied for quick or free uses, with more rigorous methods reserved for paid or batch mode analysis. Similarly, maintaining full-text search as a separate component means the work can be done by optimized search engines with built-in faceting (such as the excellent open-source Solr application).

These distinctions feel obvious and natural. They arise from a sound grounding in the split of the ABox and the TBox.

The Re-cap of Key Reasons to Maintain the TBox - ABox Split

So, to conclude this part in this occasional series, here are some of the key reasons to maintain a relative split between instances (the ABox) and the conceptual relationships that describe a world view for interpreting them (the TBox):

  • We are able to handle instance data simply. The nature of instance “things” is comparatively constant and can be captured with easily understandable attribute-value pairs
  • We can re-use these instance records in varied and multiple world views (the TBox). World views can be refined or approached from different perspectives without affecting instance data in the slightest
  • We can approach data architectural decisions from the standpoints of the work to be done, leaving open special analysis or tasks like disambiguation or full-text search
  • Ontologies (as defined by SD and focused on the TBox) are kept simpler and easier to understand. Inter-dataset relationships are asserted and testable in largely separate constructs, rather than admixed throughout
  • Relatedly, we are thus able to use ontologies to focus on the issues of mappings and conceptual relationships
  • Instance records can often be kept in situ, especially useful when incorporating the massive amounts of data in existing relational databases
  • Instance evaluations can be done separately from conceptual evaluations, which can help through triangulation in such tasks as disambiguation or entity identification
  • It is easier to convert simple data structs to the instance record structure, aiding interoperability (a subject for a later part in this series)
  • We provide a framework that is amenable to swapping in and out different analysis methods, and
  • It is easier for broader input when the task is adding and refining attributes rather than  internally consistent conceptual relationships.

Here is a final best practice suggestion when these ABox and TBox splits are maintained: Make sure as curators that new attributes added at the instance level are also added with their conceptual relationships at the TBox level. In this way, the knowledge base can be kept integral while we simultaneously foster a framework that eases the broadest scope of contributions.

Christian BeckerDBpedia Mobile and Marbles featured in Semantic Web for Dummies

Semantic Web for Dummies

The Semantic Web is so mainstream now: Oracle’s Jeff Pollock has actually published a Semantic Web for Dummies book - and it has a whopping 2 pages on DBpedia and DBpedia Mobile! Kudos for what looks like a much-needed overview.

DBTune BlogYahoo Hackday 2009

C

We went to the Yahoo Hackday this week end, with a couple of people from the C4DM and the BBC. Apart from a flaky wireless connection on the Saturday, it was a really great event, with lots of interesting talks and interesting hacks.

On the Saturday, we learned about Searchmonkey. I tried to create a small searchmonkey application during the talk, but eventually got frustrated. Apparently, Searchmonkey indexes RDFa and eRDF , but doesn't follow <link rel="alternate"/> links towards RDF representations (neither does it try to do content negotiation). So in order to create a searchmonkey application for BBC Programmes, I needed to either include RDFa in all the pages (which, hem, was difficult to do in an hour :-) ) or write an XSLT against our RDF/XML representations, which would just be Wrong, as there are lots of different ways to serialise the same RDF in an RDF/XML document.

We also learned about the Guardian Open Platform and Data Store, which holds a huge amount of interesting information. The license terms are also really permissive, even allowing commercial uses of this data. I can't even imagine how useful this data would be if it were linked to other open datasets, e.g. DBpedia, Geonames or Eurostat.

I got also a bit confused by YQL, which seems to be really similar to SPARQL, at least in the underlying concept ("a query language for the web"). However, it seems to be backed by lots of interesting data: almost all of Yahoo services, and a few third-party wrappers, e.g. for Last.fm. I wonder how hard it would be to write a SPARQL end-point that would wrap YQL queries?

Finally, on Saturday evening and Sunday morning, we got some time to actually hack :-) Kurt made a nice MySpace hack, which does an artist lookup on MySpace using BOSS and exposes relevant information extracted using the DBTune RDF wrapper, without having to look at an overloaded MySpace page. It uses the Yahoo Media Player to play the audio files this page links to.

At the same time, we got around to try out some of the things that can be built using the linked data we publish at the BBC, especially the segment RDF I announced on the linked data mailing list a couple of weeks ago. We built a small application which, from a place, gives you BBC programmes that feature an artist that is related in some way to that place. For example, Cardiff, Bristol, London or Lancashire. It might be bit slow (and the number of results are limited) as I didn't have time to implement any sort of caching. The application is crawling from DBpedia to BBC Music to BBC Programmes at each request. I just put the (really hacky) code online.

And we actually won the Backstage price with these hacks! :-)

This last hack illustrates to some extent the things we are investigating as part of the BBC use-cases of the NoTube project. Using these rich connections between things (programmes, artists, events, locations, etc.), it begins to be possible to provide data-rich recommendations backed by real stories (and not only "if you like this, you may like that"). I mentioned these issues in the last chapter of my thesis, and will try to follow up on that here!

AI3:::Adaptive Information (Mike Bergman)Ontology Best Practices for Data-driven Applications: Part 1

Structured Dynamics LLC

Concepts and an Introduction to this Occasional Series

Structured Dynamics is plowing virgin ground in how linked, structured data — powered by the flexible RDF data model — can establish new approaches useful to the enterprise. These approaches range from how applications are architected, to how data is shared and interoperated, and to how we even design and deploy applications and the data themselves.

At the core of this mindset is the concept of ‘data-driven apps‘, with their underlying structure based on ontologies. Over the coming weeks, I will be posting a series of best practices for how these ontologies can be designed, constructed and employed, and how they can shift the paradigm from static and inflexible applications to ones that are driven by the underlying data.

So, as the introduction to this occasional series, it is thus useful to define our terms and viewpoints. Clearly the two key concepts are:

  • Data-driven applications — this concept means the use of generic tools, applications and services that shape themselves and expose capabilities based on the structure of their underlying data. Generic means reusable. Unlike inflexible report writers or static tools of the past, these applications present functionality and are contextual based on the structure of the underlying data they serve. The data-driven aspects results from proper construction of the ontologies that describe this underlying data
  • Ontologies — ontologies have been something of a teeth-grinding concept for a couple of years, having been appropriated from their historical meaning of the nature of being (”ontos”) in philosophy to describe “shared conceptualizations” in computer science and knowledge engineering [1]. For its purposes, Structured Dynamics more precisely defines ontologies as the relationships of the concepts and domains embodied in the underlying things or instances described by the data. Under this approach, ontologies based on RDF become a structural representation of the data relationships in graph form. But, in addition, we also define ontologies to mean the proper description of these concepts, so as to supply the context, synonyms and aliases, and labels useful to human use and understanding.

We therefore put a fairly high threshold of construction and design on our ontologies. These imperatives provide the rationale for this series.

One complementary aspect to our design is the importance to get data in any form or serialization converted to the canonical RDF data model upon which the ontologies define and describe the data structure.  Though crucial, this aspect is not discussed further in this series.

Now, of course, when someone (me) has the chutzpah to posit “best practices” it should also be clear as to what end. Ontologies may be used for many things. Others may have as the aim completeness of domain capture, wealth of predicates, reasoning or inference. In our sense, we define “best practices” within our focus of data interoperability and data-driven apps. Your own mileage may vary.

In no particular order and with likely new topics to emerge, here is the current listing of what some of the other parts in this occasional series will contain:

  • Intro (concepts)
  • ABox - TBox split
  • Architecting (modularizing) ontologies into categories (e.g., UI/display of information; domains/instances; admin/internal apps)
  • Definition of a standard instance record vocabulary (ABox)
  • Role of an instance record vocabulary for universal struct ingest
  • Selection of core external ontologies and re-use
  • A deeper exploration of the data-driven application
  • Initial ontology building and techniques
  • Specific UI items suitable to be driven by ontologies (a listing of 20 or so items)
  • Techniques for mapping to external ontologies
  • Dataset interoperability and the myth that OWL is only useful for real-time reasoning, and
  • OWL mapping predicates, importance of class mappings, and OWL 2.

The idea throughout this series is to document best practices as encountered. We certainly do not claim completeness on these matters, but we also assert that good upfront design can deliver many free backend benefits.

If there is a particular topic missing from above that you would like us to discuss, please fire away! In any event, we will be giving you our best thinking on these topics over the coming weeks and how they might be important to you.


[1] Michael K. Bergman, 2007. An Intrepid Guide to Ontologies, May 16, 2007. See http://www.mkbergman.com/?p=374.

AI3:::Adaptive Information (Mike Bergman)Structured Dynamics Presenting at SemTech ‘09

2009 Semantic Technology Conference

First Unveiling of the Bibliographic Knowledge Network, New Product Announcement from SD

Well, it was just about six months ago that Fred Giasson and I announced our new company, Structured Dynamics. After a relatively quiet period and much laboring at the workbench, I will be presenting our new efforts at the 2009 Semantic Technology Conference, June 14th - 18th, at the Fairmont Hotel, in San Jose, California.

I will be speaking on, “BKN: Building Communities through Knowledge, and Knowledge Through Communities,” on Tuesday, June 16, during the 11:45 AM - 12:45 PM last morning session.

BKN — the Bibliographic Knowledge Network — is a major, two-year, NSF-funded project jointly sponsored by the University of California, Berkeley, Harvard University, Stanford University, and the American Institute of Mathematics, with broad private sector and community support [1]. Though initially nucleating around mathematics and statistics, each node in the network is a Web site or dataset distribution hub dedicated to a specific topic or field of knowledge. The project itself is developing a suite of tools and infrastructure based on semantic technologies for professionals, students or researchers to form new communities, and — with a single-click — to share and leverage expertise.

Besides presenting the BKN efforts for this first time, which includes an innovative and open Web services framework for collaboration, I will also be using the occasion of my talk to announce our new semantic Web and linked data RDF framework for content management systems. We’re pretty excited about these advances.

This is my first time at this premier semantic Web event, which has been steadily growing and now exceeds 1000 attendees. The agenda is most impressive; it will be difficult to decide which of the many excellent talks to choose from for each session.

If you will be at the meeting and would like to get together, drop me a note and we can schedule a time. I hope to see you there!


[1] Research supported by NSF Award 0835851, Bibliographic Knowledge Network.

Blog Data Space (Kingsley Idehen)Library of Congress & Reasonable Linked Data

While exploring the Subject Headings Linked Data Space (LCSH) recently unveiled by the Library of Congress, I noticed that the URI for the subject heading: World Wide Web, exposes an "owl:sameAs" link to resource URI: "info:lc/authorities/sh95000541" -- in fact, a URI.URN that isn't HTTP protocol scheme based.

The observations above triggered a discussion thread on Twitter that involved: @edsu, @iand, and moi. Naturally, it morphed into a live demonstration of: human vs machine, interpretation of claims expressed in the RDF graph.

What makes this whole thing interesting?

It showcases (in Man vs Machine style) the issue of unambiguously discerning the meaning of the owl:sameAs claim expressed in the LCSH Linked Data Space.

Perspectives & Potential Confusion

From the Linked Data perspective, it may spook a few people to see owl:sameAs values such as: "info:lc/authorities/sh95000541", that cannot be de-referenced using HTTP.

It may confuse a few people or user agents that see URI de-referencing as not necessarily HTTP specific, thereby attempting to de-reference the URI.URN on the assumption that it's associated with a "handle system", for instance.

It may even confuse RDFizer / RDFization middleware that use owl:sameAs as a data provider attribution mechanism via hint/nudge URI values derived from original content / data URI.URLs that de-reference to nothing e.g., an original resource URI.URL plus "#this" which produces URI.URN-URL -- think of this pattern as "owl:shameAs" in a sense :-)

Unambiguously Discerning Meaning

Simply bring OWL reasoning (inference rules and reasoners) into the mix, thereby negating human dialogue about interpretation which ultimately unveils a mesh of orthogonal view points. Remember, OWL is all about infrastructure that ultimately enables you to express yourself clearly i.e., say what you mean, and mean what you say.

Path to Clarity (using Virtuoso, its in-built Sponger Middleware, and Inference Engine):

  1. GET the data into the Virtuoso Quad store -- what the sponger does via its URIBurner Service (while following designated predicates such as owl:sameAs in case they point to other mesh-able data sources)
  2. Query the data in Quad Store with "owl:sameAs" inference rules enabled
  3. Repeat the last step with the inference rules excluded.

Actual SPARQL Queries:

Observations:

The SPARQL queries against the Graph generated and automatically populated by the Sponger reveal -- without human intervention-- that: "info:lc/authorities/sh95000541", is just an alternative name for < xmlns="http" id.loc.gov="id.loc.gov" authorities="authorities" sh95000541="sh95000541" concept="concept">, and that the graph produced by LCSH is self-describing enough for an OWL reasoner to figure this all out courtesy of the owl:sameAs property :-).

Hopefully, this post also provides a simple example of how OWL facilitates "Reasonable Linked Data".

Related

AI3:::Adaptive Information (Mike Bergman)A General Web-oriented Architecture (WOA) for Structured Data

Structured Dynamics LLC

An Abstract Web Services Framework Aids Broad Applicability

Structured Dynamics‘ product and software architecture is oriented to the Web. It emphasizes maximum flexibility, minimum “lock-in” and complete adaptability. This piece describes this architecture and how these aims are being met.

Design Objectives

Structured Dynamics is committed to what is known as a Web-oriented architecture (WOA) [1], which can be defined as:

WOA = SOA + WWW + REST

Nick Gall describes WOA as based on the architectural foundations of the Web, and is characterized by “globally linked, decentralized, and uniform intermediary processing of application state via self-describing messages.”

WOA is a subset of the service-oriented architectural style, wherein discrete functions are packaged into modular and shareable elements (”services”) that are made available in a distributed and loosely coupled manner. WOA uses the representational state transfer (REST) architectural style defined by Roy Fielding in his 2000 doctoral thesis; Fielding is also one of the principal authors of the Hypertext Transfer Protocol (HTTP) specification.

REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it.

REST and WOA stand in contrast to earlier Web service styles that are often known by the WS-* acronym (such as WSDL, etc.). WOA has proven itself to be highly scalable and robust for decentralized users since all messages and interactions are self-contained (convey “state”).

Structured Dynamics abstracts its WOA services into simple and compound ones (which are combinations of the simple). All Web services (WS) have uniform interfaces and conventions and share the error codes and standard functions of HTTP. We further extend the WOA definition and scope to include linked data, which is also RESTful. Thus, our WOA also sits atop an RDF (Resource Description Framework) database (”triple store”) and full-text search engine.

These Web services then become the middleware interaction layer for general access and querying (”endpoints”) and for tying in external software (”clients”), portals or content management systems (CMS). This design provides maximum flexibility, extensibility and substitutability of components.

Architecture & Components

Here is the basic overview of the architecture and its components. Each of its major components is described in turn as keyed by number:

General Structured Dynamics WOA Architecture

The Web Services Framework Middleware

The core to the system is the Web services middleware layer, or WS framework (WSF). Structured Dynamics is preparing this framework [see (1)] as a separately available open source package under Apache 2 license (soon to be released). It provides all of the components shown as items (2) to (5) in the diagram.

WSF is the abstraction layer that provides the Web service endpoints and services for external use. It also provides the direct hooks into the underlying RDF triple stores and full-text search engines that drive these services. At initial release, these pre-configured hooks will be to the Virtuoso RDF triple store (via ODBC, and later HTTP) and the Solr faceted text search engine (via HTTP) [2]. However, the design also allows other systems to be substituted if desired or for other specialized systems to be included (such as an analysis or advanced inference engine).

Authentication/Registration WS

The controlling Web service in WSF is the Authentication/Registration WS [see (2)]. The initial version uses registered IP addresses as the basis to grant access and privileges to datasets and functional Web services. Later versions may be expanded to include other authentication methods such as OpenID, keys (à la Amazon EC2), foaf+ssl or oauth. A secure channel (HTTPS, SSH) could also be included.

Other Core Web Services

The other core Web services provided with WSF are the CRUD functional services (create - read - update - delete), import and export, browse and search, and a basic templating system [see (3)]. These are viewed as core services for any structured dataset.

In initial release, the import and export formats will likely include TSV, RDF/XML, RDF/N3, RDF/Turtle, XML and possibly JSON.

Datasets, Access and Use Rights

A simple but elegant system guides access and use rights. First, every Web service is characterized as to whether it supports one or more of the CRUD actions. Second, each user is characterized as to whether they first have access rights to a dataset and, if they do, which of the CRUD permissions they have [see (4, 5)]. We can thus characterize the access and use protocol simply as A + CRUD.

Thereafter, a simple mapping of dataset access and CRUD rights determines whether a user sees a given dataset and what Web services (”tools”) are presented to them and how they might manipulate that data. When expressed in standard user interfaces this leads to a simple contextual display of datasets and tools.

At the Web service layer, these access values are set parametrically. The system, however, is designed to more often be driven by user and group management at the CMS level via a lightweight plug-in or module layer.

A Structured Data Foundation

Fundamentally, this “data-driven application” works because of its structured data foundation [see (6)]. Structured Dynamics employs an innovative design that exposes all RDF and record aspects to full-text search and is able to present contextual (”faceted”) results at all times in the interface [2]. In addition, the Virtuoso universal server provides RDF triple store and related structured data services.

The actual “driver” for the structured data system is one or more schema (”ontologies”) setting all of these structured data conditions. These ontologies are also managed by the triple store. The definition of these ontologies is specified in such a way with accompanying documentation to enable new scopes or domains to easily drive the system.

Interactions with CMSs and External Clients

As described by the diagram so far, all interactions with the system have been mediated either by Web service APIs or external endpoints, such as SPARQL.

For external clients or any HTTP-accessible system [see (10)], this is sufficient. Programmatically, external clients (software) may readily interact with the WS and obtain results via parametric requests.

However, the framework is also designed to be embedded within existing content management systems (CMSs). For this purpose, an additional layer is provided.

The architecture of the system can support interactions with standard open source CMSs or app frameworks such as Ruby on Rails, Django, Joomla!, WordPress, Drupal or Liferay, as examples [see (9)].

CMS interaction first occurs via specific modules or plug-ins written for that system [see (7)]. These are very lightweight wrappers that conform to the registry and hooks of the host CMS system. The actual modules or plug-ins provided are also geared to the management style of the governing CMS and what it offers [see (8)]. Each module or plug-in wrapper is a packaging decision of how to bound the WSF Web services in a configuration appropriate to the parent CMS.

This design keeps the actual tie-ins to the CMS as a very thin wrapper layer, which can embrace an open source licensing basis consistent with the host CMS. Because all of the underlying functionality has been abstracted in the WSF framework, licensing integrity across all deployment layers is maintained while allowing broad CMS interoperability. The design also allows networks to be established of multiple portals or nodes with different CMSs, perfect for broad-scale collaboration. User choice and flexibility is retained to the max.

In this design, the CMS retains its prominence and visibility (and, indeed, the standard admin and licensing basis). The WSF, specific Web services, and structured data backend remain largely invisible to the user.

Benefits of the Design

This design has manifest benefits, some of which include:

  • Broad suitability to a diversity of deployment architectures, operating systems and scales
  • Substitutability of underlying triple stores and text engines
  • Substitutability of preferred content management systems
  • Access and use of Web service endpoints independent of CMS (external clients)
  • Performant Web-oriented architecture (WOA) design
  • Common, RESTful interface for all Web services and functions in the framework
  • Easy registration of new Web services, inclusion with authorization system
  • Ability to share and interoperate data independent of client CMSs or portals
  • Use of the common lingua franca of RDF and its general advantages [3].

A Note on Tools and Philosophy

Structured Dynamics has the twin philosophies of using the best tools available yet also to give its customers and clients full choice. For instance, SD believes the Virtuoso system to be the best RDF triple store with superior functionality. Our internal benchmarks affirm its performance. Virtuoso is our standard first recommendation for a performant triple store.

Yet our architecture and design is not dependent on this application, nor indeed any other application. Deployment environments, customer preference, or pre-existing installations sometimes warrant the use of certain tools or applications. Large collaboration networks necessarily spring from diversity. It is thus critical that SD’s designs and architectures be tool-neutral and allow swapping and substitution. This is a major reason for the WOA and other RESTful design aspects of our Web services framework.

SD brings particular strengths in architecture, proper splits and design that separate ABox and TBox functionalilty [4], and ontology use and development. All of our designs are meant to be as tool neutral as possible, and we are always seeking the best of class in open source tools for any category.

Over the coming weeks and months Structured Dynamics will be rolling out packages and distribution sites for access to this framework and components built around this philosophy [5]. Stay tuned!


[1] See http://www.mkbergman.com/?cat=153.
[2] Frédérick Giasson, 2009. RDF Aggregates and Full Text Search on Steroids with Solr, April 29, 2009. See http://fgiasson.com/blog/index.php/2009/04/29/rdf-aggregates-and-full-text-search-on-steroids-with-solr/.
[3] Michael K. Bergman, 2009. Advantages and Myths of RDF, white paper from Structured Dynamics LLC, April 22, 2009, 13 pp. See http://www.mkbergman.com/wp-content/themes/ai3/files/2009Posts/Advantages_Myths_RDF_090422.pdf.
[4] As per our standard use:

Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[5] This Structured Dynamics design has been aided in part as a result of research supported by NSF Award 0835851, Bibliographic Knowledge Network. The general Web services design and architecture is based on the system already developed for the UMBEL Web services.

Orri ErlingSocial Web Camp (#5 of 5)

(Last of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.

By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks — one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant information overload.

Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.

There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don't get lost in it.

There is a CIA memorandum about how analysts misinterpret data and see what they want to see. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.

I participated in discussions on security and privacy and on mobile social networks and context.

For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.

There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies à la SQL do not work well when schema is fluid and end-users can't be expected to formulate or understand these. Remember Ted Nelson? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic's Data Patrol should be a part of the social web infrastructure of the future.

People at MIT have developed AIR (Accountability In RDF) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.

In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.

For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one's location at the granularity of the city; for some other purposes, one would say which conference room one is in.

Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.

There is a thin line between convenience and having IT infrastructure rule one's life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as Gödel put it. Similarly, human self-knowledge, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.

Orri ErlingWeb Science and Keynotes at WWW 2009 (#4 of 5)

(Fourth of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

There was quite a bit of talk about what web science could or ought to be. I will here comment a bit on the panels and keynotes, in no special order.

In the web science panel, Tim Berners-Lee said that the deliverable of the web science initiative could be a way of making sense of all the world's data once the web had transformed into a database capable of answering arbitrary queries.

Michael Brodie of Verizon said that one deliverable would be a well considered understanding of the issue of counter-terrorism and civil liberties: Everything, including terrorism, operates on the platform of the web. How do we understand an issue that is not one of privacy, intelligence, jurisprudence, or sociology, but of all these and more?

I would add to this that it is not only a matter of governments keeping and analyzing vast amounts of private data, but of basically anybody who wants to do this being able to do so, even if at a smaller scale. In a way, the data web brings formerly government-only capabilities to the public, and is thus a democratization of intelligence and analytics. The citizen blogger increased the accountability of the press; the citizen analyst may have a similar effect. This is trickier though. We remember Jefferson's words about vigilance and the price of freedom. But vigilance is harder today, not because information is not there but because there is so much of it, with diverse spins put on it.

Tim B-L said at another panel that it seemed as if the new capabilities, especially the web as a database, were coming just in time to help us cope with the problems confronting the planet. With this, plus having everybody online, we would have more information, more creativity, more of everything at our disposal.

I'd have to say that the web is dual use: The bulk of traffic may contribute to distraction more than to awareness, but then the same infrastructure and the social behaviors it supports may also create unprecedented value and in the best of cases also transparency. I have to think of "For whosoever hath, to him shall be given." [Matthew 13:12] This can mean many things; here I am talking about whoever hath a drive for knowledge.

The web is both equalizing and polarizing: The equality is in the access; the polarity in the use made thereof. For a huge amount of noise there will be some crystallization of value that could not have arisen otherwise. Developments have unexpected effects. I would not have anticipated that gaming should advance supercomputing, for example.

Wendy Hall gave a dinner speech about communities and conferences; how the original hypertext conferences, with lots of representation of the humanities, became the techie WWW conference series; and how now we have the pendulum swinging back to more diversity with the web science conferences. So it is with life. Aside from the facts that there are trends and pendulum effects, and that paths that cross usually cross again, it is very hard to say exactly how these things play out.

At the "20 years of web" panel, there was a round of questions on how different people had been surprised by the web. Surprises ranged from the web's actual scalability to its rapid adoption and the culture of "if I do my part, others will do theirs." On the minus side, the emergence of spam and phishing were mentioned as unexpected developments.

Questions of simplicity and complexity got a lot of attention, along with network effects. When things hit the right simplicity at the right place (e.g., HTML and HTTP, which hypertext-wise were nothing special), there is a tipping point.

No barrier of entry, not too much modeling, was repeated quite a bit, also in relation to semantic web and ontology design. There is a magic of emergent effects when the pieces are simple enough: Organic chemistry out of a couple of dozen elements; all the world's information online with a few tags of markup and a couple of protocol verbs. But then this is where the real complexity starts — one half of it in the transport, the other in the applications, yet a narrow interface between the two.

This then begs the question of content- and application-aware networks. The preponderance of opinion was for separation of powers — keep carriers and content apart.

Michael Brodie commented in the questions to the first panel that simplicity was greatly overrated, that the world was in fact very complex. It seems to me that that any field of human endeavor develops enough complexity to fully occupy the cleverest minds who undertake said activity. The life-cycle between simplicity and complexity seems to be a universal feature. It is a bit like the Zen idea that "for the beginner, rivers are rivers and mountains are mountains, for the student these are imponderable mysteries of bewildering complexity and transcendent dimension but for the master these are again rivers and mountains." One way of seeing this is that the master, in spite of the actual complexity and interrelatedness of all things, sees where these complexities are significant and where not and knows to communicate concerning these as fits the situation.

There is no fixed formula for saying where complexities and simplicities fit, relevance of detail is forever contextual. For technological systems, we find that there emerge relatively simple interfaces on either side of which there is huge complexity: The x86 instruction set, TCP/IP, SQL, to name a few. These are lucky breaks, it is very hard to say beforehand where these will emerge. Object oriented people would like to see such everywhere, which just leads to problems of modeling.

There was a keynote from Telefonica about infrastructure. We heard that the power and cooling cost more than the equipment, that data centers ought to be scaled down from the football stadium and 20 megawatt scale, that systems must be designed for partitioning, to name a few topics. This is all well accepted. The new question is whether storage should go into the network infrastructure. We have blogged that the network will be the database, and it is no surprise that a telco should have the same idea, just with slightly different emphasis and wording. For Telefonica, this is about efficiency of bulk delivery, for us this is more about virtualized query-able dataspaces. Both will be distributed but issues of separation of powers may keep the two roles of network with storage separate.

In conclusion, the network being the database was much more visible and accepted this year than last. The linked data web was in Tim B-L's keynote as it was in the opening speech by the Prince of Asturias.

Orri ErlingShort Recap of Virtuoso Basics (#3 of 5)

(Third of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.

Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.

If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS.

On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like.

Installation is simple, just one exe and one config file. There is a huge amount of code in installers — application code and test suites and such — but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.

Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL.

To condense further:

  • Scalable Delivery of Linked Data
  • SPARQL and SQL
    • Arbitrary RDF Data + Relational
    • Also From 3rd Party RDBMS
  • Easy Deployment
  • Standard Interfaces

Orri ErlingSearch at WWW 2009 (#2 of 5)

(Second of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

There was a workshop on semantic search plus a number of papers and of course keynotes from Google and Yahoo.

A general topic was the use of and access to query logs. Are these the monopoly of GYM (Google, Yahoo, Microsoft) or should they be made more generally available? This is a privacy question. Use of query logs and click through of search results for improved ranking was mentioned many times throughout the conference.

The semantic search workshop was largely about benchmarks for keyword search in information retrieval. For linked data, which is a database proposition, these benchmarks are not really applicable. For document search aided by semantics derived by NLP, these are of course applicable. But there is a divide in approach.

Giovanni Tummarello presented Sig.ma, a service using Sindice's RDF index for collecting all RDF statements about entities matching some set of keywords. One could then choose which sources and which entities were the right ones. One could further store such a query and embed it on a page. The point was that the filtering done manually could be persisted and republished, so as to create dynamic content aggregated from selected live sources. Further speculating, one could use such user feedback for adjusting ranking, even though Sig.ma did not. We may adopt the idea of manually excluding sources into our browser too. Fresnel lenses are another thing to look at.

There was a paper by Josep M. Pujol and Pablo Rodriguez, of Telefonica Research, about returning search to the people by means of Porqpine, a peer-to-peer search implementation based on sharing search results from search engines among peers and indexing them locally as they were retrieved. For users with similar interests, this can give a community based ranking model but has issues of privacy. Another point was that with local processing and personal scale data volumes various kinds of brute force processing were feasible that would cost a lot for the web scale. Much can be done web scale but it must be done cleverly, not with a shell script and not so ad hoc.

As a counterpoint to this, there was a talk about Hadoop and Hive, a map-reduce-based SQL-like framework. One could do an SQL GROUP BY on text files with record parsing at run time, all spread over a Hadoop cluster. The issue is, if you have a petabyte of data, you may wish to run more than one ad hoc query on it. This means that joining between partitions and complex processing becomes important. This cannot be done without indices and complex query optimization, and needs a DBMS. Stonebraker and company are fully justified in their critique of map reduce. It looks like each generation must get dazzled by the oversimplified and then retrace the same discoveries of complexity as the previous one.

Some of our future plans were confirmed by what we saw, for example as concerns:

  • Interactively selecting sources for search, showing the graphs, then interactively refining
  • More social networks, more network analysis, and more work on social recommendation
  • Real time indexing of new pings, filling the store by forwarding queries to search engines, and harvesting micro-formats from results
  • Using entity extraction

These are all items in the pipeline, easy to do on top of the existing platform. For the machine learning and NLP parts, we will partner with others, details will be worked out while we work on the items we implement by ourselves.

Frederick GiassonRDF Aggregates and Full Text Search on Steroids with Solr

Preamble

As I explained in my latest blog post, I am now starting to talk about a couple of things I have been working on in the last few months that will lead to a release, by Structured Dynamics, in the coming months. This blog post is the first step into that path. Enjoy!

Introduction

I have been working with RDF, SPARQL and triple stores for years now. I have created many prototypes and online services using these technologies. Having the possibility to describe everything with RDF, and having the possibility to index everything in a triple store that you can easily query the way you want using SPARQL, is priceless. Using RDF saves development and maintenance cost because of the flexibility of store (triple store), the query language (SPARQL), and associated schemas (ontologies).

However, even if this set of technologies can do everything, quickly and efficiently, it is not necessarily optimal for all tasks you have to do. As we will see in this blog post, we use RDF for describing, integrating and managing any kind of data (structured or unstructured) that exists out there. RDF + Ontologies are what we use as the canonical expression of any kind of data. It is the triple store that we use to aggregate, index and manage that data, from one or multiple data sources. It is the same triple store that we use to feed any other system that can be used in our architecture. The triple store is the data orchestrator in any such architecture.

In this blog post I will show you how this orchestrator can be used to create Solr indexes that are used in the architecture to perform three functions that Solr has been built to perform optimally: full-text search, aggregates and filtering. So, while a triple store can perform these functions, it is not optimal for what we have to do.

Overview

The idea is to use the RDF data model and a triples store to populate the Solr schema index. We leverage the powerful and flexible data representation framework (RDF), in conjunction with the piece of software that lets you do whatever you want with that data (Virtuoso), to feed a carefully tailored Solr schema index to optimally perform three things: full-text search, aggregates and filtering. Also, we want to leverage the ontologies used to describe this data to be able to infer things vis-à-vis these indexed resources in Solr. This leverage enables us to use inference on full-text search, aggregates and filtering, in Solr! This is quite important since you will be able to perform full text searches, filtered by types that are inferred!

Some people will tell me that they can do this with a traditional relational database management system: yes. However, RDF + SPARQL + Triple Store is so powerful to integrate any kind of data, from any data sources; it is so flexible that it saves precious development and maintenance resources: so money.

Solr

What we want to do is to create some kind of “RDF” Solr index. We want to be able to perform full-text searches on RDF literals; we want to be able to aggregate RDF resources by the properties that describe them, and their types; and finally we want to be able to do all the searches, aggregation and filtering using inference.

So the first step is to create the proper Solr schema that will let you do all these wonderful things.

The current Solr index schema can be downloaded here. (View source if simply clicking with your browser.)

Now, let’s discuss this schema.

Solr Index Schema

A Solr schema is composed of basically two things: fields and type of fields. For this schema, we only need two types of fields: string and text. If you want more information about these two types, I would refer you to the Solr documentation for a complete explanation of how they work. For now, just consider them as strings and texts.

What interests us is the list of defined fields of this schema (again, see download):

  • uri [1] - Unique resource identifier of the record
  • type [1-N] - Type of the record
  • inferred_type [0-N] - Inferred type of the record
  • property [0-N] - Property identifier used to describe the resource and that has a literal as object
  • text [0-N] (same number as property) - Text of the literal of the property
  • object_property [0-N] - Property identifier used to describe the resource where the object is a reference to another resource and that this other resource can be described by a literal
  • object_label [0-N] (same number as object_property) - Text used to refer to the resource referenced by the object_property

Full Text Search

A RDF document is a set of multiple triples describing one or multiple resources. Saying that you are doing full-text searches on RDF documents is certainly not the same thing as saying that you are doing full-text searches on traditional text documents. When you describe a resource, you rarely have more than a couple of strings, with a couple of words each. It is generally the name of the entity, or a label that refers to it. You will have different numbers, and sometimes some description (a short biography, or definition, or summary, as examples). However, except if you index an entire text document, the “textual abundance” is quite poor compared to an indexed corpus of documents.

In any case, this doesn’t mean that there are no advantages in doing full-text searches on RDF documents (so, on RDF resource descriptions). But, if we are going to do so, let’s do so completely, and in a way that meets users’ expectations for full-text document search.  By applying this mindset, we can apply some cool new tricks!

Intuitively the first implementation of a full-text search index on RDF documents would simply make a key-value pair assignment between a resource URI and its related literals. So, when you perform a full-text search for “Bob”, you get a reference on all the resources that have “Bob” in one of the literals that describe these resources.

This is good, but this is not enough. This is not enough because this breaks the more basic behavior for any users that uses full-text search engines.

Let’s say that I know the author of many articles is named “Bob Carron”. I have no idea what are the titles of the articles he wrote, so I want to search for them. With the system exposed above, if I do a search for “Bob Carron”, I will most likely get back as a result the reference to “Bob Carron”, the author person. This is good, but this is not enough.

On the results page, I want the list of all articles that Bob wrote! Because of the nature of RDF, I don’t have this “full-text” information of “Bob” in the description of the articles he wrote. Most likely, in RDF, Bob will be related to the articles he wrote by reference (object reference with the URIs of these articles), i.e., <this-article> <author> <bob-uri>. As you can notice, we won’t get back any articles in the resultset for the full-text query “Bob Carron” because this textual information doesn’t exist in the index at the level of the articles he wrote!

So, what can we do?

A simple trick will beautifully do the work. When we create the Solr index, what we want is to add the textual information of the resources being referenced by the indexed resources. For example, when we create the Solr document that describes one of the articles written by Bob, we want to add the literal that refers to the resource(s) referenced by this article. In this case, we want to add the name of the author(s) in the full-text record of that article. So, with this simple enhancement, if we do a search for “Bob Carron”, we will now get the list of all resources that refers to Bob too! (articles he wrote, other people that know him, etc).

So, this is the goal of the “object_property” and “object_label” fields of the Solr index. In the schema above, the “object_property” would be “author” and the “object_label” would be “Bob Carron”. This information would belong to the Solr document of the Article 1.

Full Text Search Prototype

Let’s take a look at the prototype running system (see screen capture below).



The dataset loaded in this prototype is Mike’s Sweet Tools. As you notice in the prototype screen, many things can be done with the simple Solr schema we published above. Let’s start with a search for the word “test”. First, we are getting a resultset of 17 things that have the “test” word in any of their text-indexed fields.

What is interesting with that list is the additional information we now have for each of these resultsets that come from the RDF description of these things, and the ontologies that have been used to describe them.

For example, if we take a look at Result #4, we see that the word “test” has been found in the description of the Ontology project for the “TONES  Ontology Repository” record. Isn’t that precision far more useful than saying: the word “test” has been found in “this webpage”? I’ll let you think about it.

Also, if we take a look at Result #1, we know that the word “test” has been found in the homepage of the Data Converter Project for the”Talis Semantic Converter” record.

Additionally, by leveraging this Solr index, we can do efficient aggregates on the types of the things returned in the resultset for further filtering. So, in the section “Filter by kinds” we know what kinds of things are returned for the query “test” against this dataset.

Finally, we can use the drop-down box at the right to do a new search (see screenshot), based on the specific kind of things indexed in the system. So, I could want to make a new search, only for “Data specification projects” with the keyword “rdf”. I already know from the user interface that there are 59 such projects.

All this information comes form the Solr index at query time, and basically for free by virtue of how we set up the system. Everything is dynamically aggregated and displayed to the user.

However, there are a few things that you won’t notice here that are used:  1) SPARQL queries to the triple store to get some more information to display on that page; 2) the use of inference (more about it below), and; 3) the leveraging of the ontologies descriptions.

In any case, on one of SD’s test datasets of about 3 million resources, such a page is generated within a few hundred milliseconds: resultset, aggregates, inference and description of things displayed on that page.  This same 3 million resources that returns results in a few hundred milliseconds did so on a small Amazon EC2 server instance for 10 cents per hour. How’s that for performance?!

Aggregates and Filtering on Properties and Types

But, we don’t want to merely do full-text search on RDF data. We also want to do aggregates (how many records has this type, or this property, etc.) and filtering, at query time, in a couple of milliseconds. We already had a look at these two functions in the context of a full-text search. Now let’s see it in action in some dataset prototype browsing tools that uses the same Sweet Tools dataset.

In a few milliseconds, we get the list of different kind of things that are indexed in a given dataset. We can know what are the types, and what is the count for each of these types. So, the ontologies drive the taxonomic display of the list of things indexed in the dataset, and Solr drives the aggregation counts for each of these types of things.

Additionally, the ontologies and the Virtuoso inference rules engine are used to make the count, by inference. If we take the example of the type “RDF project”, we know there are 49 such projects. However, not all these projects are explicitly typed with the “RDF project” type. In fact, 7 of these “RDF project” are “RDF editor project” and 6 are “RDF generator project”.

This is where inference can play an important role: an article is a document. If I browse documents, I want to include articles as well. This “broad context retrieval” is driven by the description of the ontologies, and by inference; this is the same thing for these projects; and this is the same thing for everything else that is stored as structured RDF and characterized by an ontology.

The screenshot above shows how these inferences and their nestings could present themselves in a user interface.

Once the user clicks on one of these types, he starts to browse all things of that type. On the next screenshot below, Solr is used to add filters based on the attributes used to describe these things.

In some cases, I may want to see all the Projects that have a review. To do so, I would simply add this filter criteria on the browsing page and display the “Projects” that have a “review” of them. And thanks to Solr, I already know how many such Projects have reviews, right before even taking a look at them.

Note, then, on this screenshot that the filters and counts come from Solr.  The list of the actual items returned in the resultset comes from a SPARQL query, and the name of the types and properties (and their descriptions) come from the description of the ontologies used.

This is what all this stuff is about: creating a symbiotic environment where all these wonderful systems live together to do the effective management of the structured data.

Populating the Solr Index

Now that we know how to use Solr to perform full-text searches, and the aggregating and filtering of structured data, one question still remains: how do we populate this index? As stated at above, the goal is to manage all the structured data of the system using a triple store and ontologies. Then it is to use this triple store to populate the Solr index.

Structured Dynamics uses the Virtuoso Open Source as the triple store to populate this index for multiple reasons. One of the main ones is for its performance and its capability to do efficient basic inference. The goal is to send the proper SPARQL queries to get the structured data that we will index in the Solr schema index that we talked about above. Once this is done, all the things that I talked about in this blog post become possible, and efficient.

Syncing the Index

However, in such a setup, we have to keep one thing in mind: each time the triple store is updated (a resource is created, deleted or updated), we have to sync the Solr index according to these modifications.

What we have to do is to detect any change in the triple store, and to reflect this change into the Solr index. What we have to do is to re-create the entire Solr document (the resource that changed in the triple store) using the <add /> operation.

This design raises an issue with using Solr: we cannot simply modify one field of a record. We have to re-index the entire description of the document even if we want to modify a single field of any document. This is a limitation of Solr that is currently addressed in this new feature proposition; but it is not currently available for prime time.

Another thing to consider here is to properly sync the Solr index with any ontology changes (at the level of the class description) if you are using the inference feature. For example, assume you have an ontology that says that class A is a sub-class-of class B. Then, assume the ontology is refined to say that class A is now a sub-class-of class C, which itself is a sub-class-of class B. To keep the Solr index synced with the triple store, you will have to perform all modifications that affect all the records of these types. This means that the synchronization doesn’t only occur at the level of the description of a record; but also at the level of the changes in the ontologies used to describe those records.

Conclusion

One of the main things to keep in mind here is that now, when we develop Web applications, we are not necessarily talking about a single software application, but a group of software applications that compose an architecture to deliver a service(s). In any such architecture, what is at the center of it is Data.

Describing, managing, leveraging and publishing this data is at the center of any Web service. It is why it is so important to have the right flexible data model (RDF), with the right flexible query language (SPARQL), and the right data management system (triple store) in place. From there, you can use the right tools to make it available on the Web to your users.

The right data management system is what should be used to feed any other specific systems that compose the architecture of a Web service. This is what we demonstrated with Solr; but it is certainly not limited to it.

Blog Data Space (Kingsley Idehen)Linked Data & Identity

A person, organization, place, idea, subject matter topic/heading, and other real world things possess "identity" -- that is, a constellation of characteristics that distinguish them from any other identity. Associated with this abstraction can be a label used as a reference, or "identifier". This is the distinction between a thing and the name of the thing.

section from IETF's Domain Keys spec. (paraphrased by me)

.

The Linked Data meme is based on the use of HTTP based URIs as reference / identifier labels associated with the "identity abstraction" referred to above. Thus, when you de-reference (request information about) an HTTP based URI you ultimately end up with a resource URL that exposes the "constellation of characteristics" mentioned above, in a representation negotiated at request time -- between an HTTP client and server e.g., (X)HTML, JSON, XML, RDF/XML, N3, Turtle, Trix, others :-)

Related

Blog Data Space (Kingsley Idehen)What is the Linked Data Meme about?

The act of using URIs to "refer to" (reference) Web addressable data objects. It's also the act of using the same URI to de-reference the description of a referenced data object; in this case, the representation of the description is negotiated by a Web client and/or Web server. Thus, you can access the description of a data object via data representation formats such as: JSON, XML, (X)HTML, RDF/XML, N3, Turtle, TriX etc.

Note: In proper Web parlance, a data object is referred to as a resource.

Simple example (using DBpedia)

In the Linked Data realm, If you want to make a reference to the Linked Data meme in a blog post, you are better off using the resource URI: http://dbpedia.org/resource/Linked_Data, instead of the Web page URL: http://dbpedia.org/page/Linked_Data, which is the address of a physical document (an information conveying artifact) that at best visually presents the negotiated representation of a resource description.

Why is this valuable?

In the simplest sense, you only have one focal point for referencing (referring to) and de-referencing (retrieving data about) a given Web resource. It protects you from the impact of Web document location changes (amongst many other things).

Remember, a single URI is a conduit into a realm where the identity, access, representation, presentation, and storage of a resource (data object) are completely distinct. It's the mechanism for conducting data across network, machine, operating system, dbms engine, application, and service (API) boundaries. Thus, without "linked data meme" prescribed URI referencing and de-referencing, we are simply back to "business as usual" re. the industry at large, where networks, operating systems, dbms engines, applications, and services (APIs) become the basis for "data lock-in" and silo construction.

Going forward

Take a second to think about the profound virtues of the ubiquitous Web of Linked Document URLs that we have today, and then apply that thinking to the burgeoning Web of Linked Data URIs, that has just turned corner and heading in everyone's direction at full blast.

Note to "Social Media" players: Who you know isn't the canonical object of sociality. What you are i.e., your description and the data objects it exposes, are real objects of your sociality :-)

Related

Orri ErlingLinked Data at WWW 2009 (#1 of 5)

(First of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

We gave a talk at the Linked Open Data workshop, LDOW 2009, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments.

The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance data is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as RDF.

To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key.

People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like x is a bit like y is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times.

Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player.

We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in SQL and SPARQL.

To showcase this, we do linked data search, browsing, and so on, but are essentially a platform provider.

Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today's retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on EC2. With this on one side and Virtuoso on the other, a lot that was impractical in the past is now within reach. Like Giovanni Tummarello put it for airplanes, the physics are as they were for da Vinci but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone.

A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy.

I answered as follows, which apparently cannot be repeated too much:

First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. Ceterum censeo, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent.

Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions.

There was a talk about MashQL by Mustafa Jarrar, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated "on the fly" as they are being composed. As it happens, we already have an XML-based query abstraction layer incorporated into Virtuoso 6.0's built-in Faceted Data Browser Service, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: Use the faceted browser service to get vastly higher performance when querying against Linked Data, be it DBpedia or the entity LOD Cloud. Virtuoso 6.0 (Open Source Edition) "TP1" is now publicly available as a Technology Preview (beta).

We heard that there is an effort for porting Freebase's Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting.

As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is "virtualized" into the database cloud or the local secure server, as the use case may require.

For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might de facto have that in just a bit, whenever there is a working draft from the SPARQL WG.

Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject semantic web papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions.

For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.

Blog Data Space (Kingsley Idehen)Simple Explanation of RDF and Linked Data Dynamics

What is RDF?

The acronym stands for: Resource Description Framework. And that's just what it is.

RDF is comprised of a Data Model (EAV/CR Graph) and Data Representation Formats such as: N3, Turtle, RDF/XML etc.

RDF's essence is about: "Entities" and "Attributes" being URI based, while "Values" may be URI or Literals (typed or untyped) based.

URIs are Entity Identifiers.

What is Linked Data?

Short for "Web of Linked Data" or "Linked Data Web".

A term coined by TimBL that describes an HTTP based "data access by reference pattern" that uses a single pointer or handle for "referring to" and "obtaining actual data about" an entity.

Linked Data uses the deceptively simple messaging scheme of HTTP to deliver a granular entity reference and access mechanism that transcends traditional computing boundaries such as: operating system, application, database engines, and networks.

How are Linked Data & RDF Related?

Linked Data simply mandates the following re. RDF:

  • URIs should be HTTP based so that you can "refer to" (Reference) an Entity, its Attributes, or URI based Attribute values via the Web (infact any HTTP based network e.g., Intranets and Extranets)
  • URIs should also be HTTP based so that you can use them to de-reference resource descriptions via the Web (or Intranets and Extranets).

Note: by Entity I am also referring to: a resource (Web parlance), data item, data object, real-world object, or datum.

Linked Data is also about, using URIs and HTTP's content negotiation feature to separate: presentation, representation, access, and identity of data items. Even better, content negotiation can be driven by user agent and/or data server based quality of service algorithms (representation preference order schemes).

To conclude, Linked Data is ultimately about the realization that: Data is the new Electricity, and it's conductors are URIs :-)

Tip to governments of the world: we are in exponential times, the current downturn is but one side of the "exponential times ledger", the other side of the "exponential times ledger" is simply about unleashing "raw data" -- in structured form -- into the Web, so that "citizen analysts" can blossom and ultimately deliver the transparency desperately sought at every level of the economic value chain. Think: "raw data ready" whenever you ponder about "shovel ready" infrastructure projects!

Blog Data Space (Kingsley Idehen)Take N: Yet Another OpenLink Data Spaces Introduction

Problem:

Your Life, Profession, Web, and Internet do not need to become mutually exclusive due to "information overload".

Solution:

A platform or service that delivers a point of online presence that embodies the fundamental separation of: Identity, Data Access, Data Representation, Data Presentation, by adhering to Web and Internet protocols.

How:

Typical post installation (Local or Cloud) task sequence:

  1. Identify myself (happens automatically by way of registration)
  2. If in an LDAP environment, import accounts or associate system with LDAP for account lookup and authentication
  3. Identify Online Accounts (by fleshing out profile) which also connects system to online accounts and their data
  4. Use Profile for granular description (Biography, Interests, WishList, OfferList, etc.)
  5. Optionally upstream or downstream data to and from my online accounts
  6. Create content Tagging Rules
  7. Create rules for associating Tags with formal URIs
  8. Create automatic Hyperlinking Rules for reuse when new content is created (e.g. Blog posts)
  9. Exploit Data Portability virtues of RSS, Atom, OPML, RDFa, RDF/XML, and other formats for imports and exports
  10. Automatically tag imported content
  11. Use function-specific helper application UIs for domain specific data generation e.g. AddressBook (optionally use vCard import), Calendar (optionally use iCalendar import), Email, File Storage (use WebDAV mount with copy and paste or HTTP GET), Feed Subscriptions (optionally import RSS/Atom/OPML feeds), Bookmarking (optionally import bookmark.html or XBEL) etc..
  12. Optionally enable "Conversation" feature (today: Social Media feature) across the relevant application domains (manage conversations under covers using NNTP, the standard for this functionality realm)
  13. Generate HTTP based Entity IDs (URIs) for every piece of data in this burgeoning data space
  14. Use REST based APIs to perform CRUD tasks against my data (local and remote) (SPARQL, GData, Ubiquity Commands, Atom Publishing)
  15. Use OpenID, OAuth, FOAF+SSL, FOAF+SSL+OpenID for accessing data elsewhere
  16. Use OpenID, OAuth, FOAF+SSL, FOAF+SSL+OpenID for Controlling access to my data (Self Signed Certificate Generation, Browser Import of said Certificate & associated Private Key, plus persistence of Certificate to FOAF based profile data space in "one click")
  17. Have a simple UI for Entity-Attribute-Value or Subject-Predicate-Object arbitrary data annotations and creation since you can't pre model an "Open World" where the only constant is data flow
  18. Have my Personal URI (Web ID) as the single entry point for controlled access to my HTTP accessible data space

I've just outlined a snippet of the capabilities of the OpenLink Data Spaces platform. A platform built using OpenLink Virtuoso, architected to deliver: open, platform independent, multi-model, data access and data management across heterogeneous data sources.

All you need to remember is your URI when seeking to interact with your data space.

Related

  1. Get Yourself a URI (Web ID) in 5 Minutes or Less!
  2. Various posts over the years about Data Spaces
  3. Future of Desktop Post
  4. Simplify My Life Post by Bengee Nowack

Displacement Activities (Tom Heath)The Semantic Web is the Cake…but the Technologies are not the Layers

My last post about the relationship between Linked Data, the Semantic Web and the Semantic Web technology stack seemed to create more debate and disagreement than clarity. Not to be discouraged by this, I’ve been giving some more thought to analogies that may help to illuminate the the relationship between different concepts in the Semantic Web space. This got me thinking about the Semantic Web layer cake.

The layer cake diagram is probably one of the most used and abused images associated with the Semantic Web vision. In the diagram, each technology or concept in the Semantic Web stack is a layer, with Crypto providing some kind of irregular icing down one side. I’d like to propose a different interpretation of the Semantic Web as a cake.

In my view, the technologies aren’t layers in the finished cake, they’re the raw ingredients that must be mixed and baked to make the cake that is the Semantic Web itself. URIs are the grains of flour; an ingredient that is essential but by itself rather bland, and lacking form and coherence. RDF triples are the egg that can bind together this URI flour. This cake is taking shape, but it’s lacking flavour. In the Semantic Web cake these flavourings, such as FOAF, SIOC, or the Programmes Ontology, are concocted on a base of RDFS and OWL. Simple cakes based on one or two flavours can be very tasty, but for our Semantic Web cake to be truly delicious we want a wide range of flavours, with some dominating others in different parts of the cake.

Once we’ve baked our cake, by putting our RDF data online according to the Linked Data principles, we’ll probably want to decorate it. Perhaps some icing or cherries on top, in the form of inferred RDF triples, would make it even more delicious. With such an appealing data cake it’s inevitable that people will want to consume it, but we have to make sure that everyone can have a slice rather than letting a big data gluton run off with the cake and deprive everyone else of this treat. We need some sort of knife; preferably one like SPARQL, that allows people to help themselves to the parts of the cake they like best. Will the cake baked and decorated, and will all the tools in place, it’s time to invite some friends round (maybe even some agents) and start consuming.

(I see that Jim Hendler’s keynote at ESWC2009 will talk about the layer cake; I’m intrigued to see how he choses to serve up the analogy).

AI3:::Adaptive Information (Mike Bergman)UMBEL Now Included in SearchMonkey

SearchMonkey

SearchMonkey’s Recommended Vocabularies a Useful Resource

I am pleased to report that UMBEL is now included as one of the recommended vocabularies for the Yahoo! SearchMonkey service. Using SearchMonkey, developers and site owners can use structured data to enhance the value of standard Yahoo! search results and customize their presentation, including through “infobars“. SearchMonkey is integral to a concerted effort by Yahoo! to embrace structured data, RDF and the semantic Web.

SearchMonkey was first announced in February 2008 with a beta release in April and then public release in May with 28 supported vocabularies. Then, last October, an additional set of common, external vocabularies were recommended for the system including DBpedia, Freebase, GoodRelations and SIOC. At the same time, some further internal Yahoo! vocabularies and standard Web languages (e.g., OWL, XHTML) were also added.

This is the first vocabulary update since then. Besides UMBEL, the AB Meta and Semantic Tags vocabularies have also been added to this latest revision. (There have also been a few deprecations over time.)

A recommended vocabulary means that its namespace prefix is recognized by SearchMonkey. The namespaces for the recommended vocabularies are reserved. Though site owners may customize and add new SearchMonkey structure, they must be explicitly defined in specific DataRSS feeds.

Structured data may be included in Yahoo! search results from these sources:

  • Yahoo! Index — the core Yahoo! search data with limited structure such as the page’s title, summary, file size, MIME type, etc. This structure is only provided by Yahoo!
  • Semantic Web Data — including microformats and RDF data embedded in the host page
  • Data Feed — A feed of Yahoo! native DataRSS provided by a third party site
  • Custom Data Service — Any data extracted from an (X)HTML page or web service and represented within SearchMonkey as DataRSS.

As a recommended vocabulary, UMBEL namespace references can now be embedded and recognized (and then presented) in Yahoo! search results.

The Current Vocabulary Set

Here are the 34 current vocabularies (plus five deprecated) recognized by the system:

Prefix Name Namespace
abmeta AB Meta http://www.abmeta.org/ns#
action SearchMonkey Actions http://search.yahoo.com/searchmonkey/action/
assert SearchMonkey Assertions (deprecated) http://search.yahoo.com/searchmonkey/assert/
cc Creative Commons http://creativecommons.org/ns#
commerce SearchMonkey Commerce http://search.yahoo.com/searchmonkey/commerce/
context SearchMonkey Context (deprecated) http://search.yahoo.com/searchmonkey/context/
country SearchMonkey Country Datatypes http://search.yahoo.com/searchmonkey-datatype/country/
currency SearchMonkey Currency Datatypes http://search.yahoo.com/searchmonkey-datatype/currency/
dbpedia DBPedia http://dbpedia.org/resource/
dc Dublin Core http://purl.org/dc/terms/
fb Freebase http://rdf.freebase.com/ns/
feed SearchMonkey Feed http://search.yahoo.com/searchmonkey/feed/
finance SearchMonkey Finance http://search.yahoo.com/searchmonkey/finance/
foaf FOAF http://xmlns.com/foaf/0.1/
geo GeoRSS http://www.georss.org/georss#
gr GoodRelations http://purl.org/goodrelations/v1#
job SearchMonkey Jobs http://search.yahoo.com/searchmonkey/job/
media SearchMonkey Media http://search.yahoo.com/searchmonkey/media/
news SearchMonkey News http://search.yahoo.com/searchmonkey/news/
owl OWL ontology language http://www.w3.org/2002/07/owl#
page SearchMonkey Page (deprecated) http://search.yahoo.com/searchmonkey/page/
product SearchMonkey Product http://search.yahoo.com/searchmonkey/product/
rdf RDF http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs RDF Schema http://www.w3.org/2000/01/rdf-schema#
reference SearchMonkey Reference http://search.yahoo.com/searchmonkey/reference/
rel SearchMonkey Relations (deprecated) http://search.yahoo.com/searchmonkey-relation/
resume SearchMonkey Resume http://search.yahoo.com/searchmonkey/resume/
review Review http://purl.org/stuff/rev#
sioc SIOC http://rdfs.org/sioc/ns#
social SearchMonkey Social http://search.yahoo.com/searchmonkey/social/
stag Semantic Tags http://semantictagging.org/ns#
tagspace SearchMonkey Tagspace (deprecated) http://search.yahoo.com/searchmonkey/tagspace/
umbel UMBEL http://umbel.org/umbel/sc/
use SearchMonkey Use Datatypes http://search.yahoo.com/searchmonkey-datatype/use/
vcal VCalendar http://www.w3.org/2002/12/cal/icaltzd#
vcard VCard http://www.w3.org/2006/vcard/ns#
xfn XFN http://gmpg.org/xfn/11#
xhtml XHTML http://www.w3.org/1999/xhtml/vocab#
xsd XML Schema Datatypes http://www.w3.org/2001/XMLSchema#

In addition, there are a number of standard datatypes recognized by SearchMonkey, mostly a superset of XSD (XML Schema datatypes).

What is emerging from this Yahoo! initiative is a very useful set of structured data definitions and vocabularies. These same resources can be great starting points for non-SearchMonkey applications as well.

For More Information

There is quite a bit of online material now available for SearchMonkey, with new expansions and revisions also accompanying this most recent release. As some starting points, I recommend:

DBTune BlogBrands, series, categories and tracklists on the new BBC Programmes

I just posted a small article on the BBC Radio Labs blog about the new features of the BBC Programmes website. Hopefully that makes some sense and highlights some of the things we've been working on over the last six months! Spoiler: lots of nice nice RDF :-)

Frederick GiassonStarting of a New Era

More than three months ago I announced the creation of Structured Dynamics LLC. Since then I stayed mute on my blog because I was too busy doing research, software and business development for this new venture with Mike. However things are moving fast here; and now is the time to start talking about some of the technical stuff we have been working on for about five months now.

I am not announcing anything right now, and won’t for the next couple of months neither. As you are probably expecting, we are working on some products and services. I can say that everything will be released in the open source domain under the Apache 2 license. However I won’t say anything else about this thing.

But, what I will do in the following days, weeks and months is to start talking about the underlying technologies of this system and about methods we used to solve some of our problems. There won’t be any particular order, so I will only talk about interesting stuff we found when it pleases me. We think we are making some useful advances, especially on the architectural and design side, and look forward to sharing what we are learning with you.

Stay tuned!

AI3:::Adaptive Information (Mike Bergman)Open Source as an Antidote to the ‘Agency Society’

From http://steynian.wordpress.com/

The Financial Meltdown May Suggest Some Lessons for IT

John Bogle, the retired founder of The Vanguard Funds and a long-standing clarion voice of financial good sense, observes that we have moved from an “ownership society” to an “agency society” in the past 50 years.

In an article yesterday by Gretchen Morgenson in the NYT [1], Bogle points to the loss of fiduciary responsibility as one of many factors enabling the recent financial meltdown. In moving from an “ownership society” to one of agency where external parties handle many of our affairs, we not only have passed the buck, but we have done so to parties that have too often not looked after our own best interests.

In the financial realm, this is called “care of duty” and it is part and parcel of an agent’s fiduciary responsibility. Fiduciary may sound fancy, but it simply means that a party you entrust is indeed trustworthy to look after your interests. So many have failed us, and we have failed ourselves to oversee and ensure this duty is being met.

I suspect the same failure and lack of oversight has been true for quite some time in IT, as well. We hire consultants to write the specs, select the vendors, and then oversee delivery. An “agent society” has predictably led to failure rates for major IT projects that all of us often quote as exceeding 60%.

Shame, shame.

A significant — but not often publicized — advantage of open source software is that we, as users, are now completely free to inspect and test the basis of the code. This is goodness. And, it removes the distance that is inherent to “agency”.

Now, to square the circle and make sure we are indeed getting what we expect and pay for, we also have the responsibility as consumers to inspect and verify. The beauty of open source is that what runs our systems is now transparent.

But, whether it works or not, or whether we are getting what we paid for, is still our responsibility. Open source or not, some things never change . . . . It is just that with open source, it is a whole lot easier.

Tag: We’re it.


 [1] Gretchen Morgenson, “He Doesn’t Let Money Managers Off the Hook,” published in the Sunday Business section of the New York Times, April 12, 2009; see http://www.nytimes.com/2009/04/12/business/12gret.html?em.

AI3:::Adaptive Information (Mike Bergman)Advantages and Myths of RDF

RDF logo

A 10th Birthday Salute to RDF’s Role in Powering Data Interoperability

There has been much welcomed visibility for the semantic Web and linked data of late. Many wonder why it has not happened earlier; and some observe progress has still been too slow. But what is often overlooked is the foundational role of RDF — the Resource Description Framework.

From my own perspective focused on the issues of data interoperability and data federation, RDF is the single most important factor in today’s advances. Sure, there have been other models and other formulations, but I think we now see the Goldilocks “just right” combination of expressiveness and simplicity to power the foreseeable future of data interoperability.

So, on this 10th anniversary of the birth of RDF [1], I’d like to re-visit and update some much dated discussions regarding the advantages of RDF, and more directly address some of the mis-perceptions and myths that have grown up around this most useful framework.

By request, this article is now available as a PDF download.

A Simple Intro to RDF

RDF is a data model that is expressed as simple subject-predicate-object “triples”. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or, the ball is round. It may sound like a kindergarten reader, but it is how data can be easily represented and built up into more complex structures and stories.

A triple is also known as a “statement” and is the basic “fact” or asserted unit of knowledge in RDF. Multiple statements get combined together by matching the subjects or objects as “nodes” to one another (the predicates act as connectors or “edges”). As these node-edge-node triple statements get aggregated, a network structure emerges, known as the RDF graph.

The referenced “resources” in RDF triples have unique identifiers, IRIs, that are Web-compatible and Web-scalable. These identifiers can point to precise definitions of predicates or refer to specific concepts or objects, leading to less ambiguity and clearer meaning or semantics.

In my own company’s approach to RDF, basic instance data is simply represented as attribute-value pairs where the subject is the instance itself, the predicate is the attribute, and the object is the value. Such instance records are also known as the ABox. The structural relationships within RDF are defined in ontologies, also known as the TBox, which are basically equivalent to a schema in the relational data realm.

RDF triples can be applied equally to all structured, semi-structured and unstructured content. By defining new types and predicates, it is possible to create more expressive vocabularies within RDF. This expressiveness enables RDF to define controlled vocabularies with exact semantics. These features make RDF a powerful data model and language for data federation and interoperability across disparate datasets.

There are many excellent introductions or tutorials to RDF; a recommended sampling is shown in the endnotes [2].

Is RDF a Framework, Data Model or Vocabulary?

Well, the answer to the rhetorical question is, all three!

The RDF data model provides an abstract, conceptual framework for defining and using metadata and metadata vocabularies. See: We were able to use all three concepts in a single sentence!

The RDF model draws on well-established principles from various data representation communities. RDF properties may be thought of as attributes of resources and in this sense correspond to traditional attribute-value pairs. RDF properties also represent relationships between resources and an RDF model can therefore resemble an entity-relationship diagram. . . . In object-oriented design terminology, resources correspond to objects and properties correspond to instance variables. [1]

But, actually, because RDF is simultaneously a framework, data model and basis for building more complex vocabularies, it is both simple and complex at the same time.

It is first perhaps best to understand basic RDF as a data model of triples with very few (or unconstrained) semantics [3]. In its base form, it has no range or domain constraints; has no existence or cardinality constraints; and lacks transitive, inverse or symmetrical properties (or predicates) [4]. As such, basic RDF has limited reasoning support. It is, however, quite useful in describing static things or basic facts.

In this regard, RDF in its base state is nearly adequate for describing the simple instances and data records of the world, what is called the ABox in description logics.

RDFS (RDF Schema) is the next layer in the RDF stack designed to overcome some of these baseline limitations. RDFS introduces new predicates and classes that bound these semantics. Importantly, RDFS establishes the basic constructs necessary to create new vocabularies, principally through adding the class and subClass declarations and adding domain and range to properties (the RDF term for predicates). Many useful vocabularies have been created with RDFS and it is possible to apply limited reasoning and inference support against them.

The next layer in the RDF stack is OWL, the Web Ontology Language. It, too, is based on RDF. The first versions of OWL were themselves layered from OWL Lite to OWL DL to OWL Full. OWL Lite and OWL DL are both decidable through the first-order logic basis of description logics (the basis for the acronym in OWL DL). OWL Full is not decidable, but provides an OWL counterpart to fragmented RDF and RDFS statements that are desirable in the aggregate, with reasoning applied where possible.

OWL provides sufficient expressive richness to be able to describe the relationships and structure of entire world views, or the so-called terminological (TBox) construct in description logics. Thus, we see that the complete structural spectrum of description logics can be satisfied with RDF and its schematic progeny, with a bit of an escape hatch for combining poorly defined or structural pieces via using OWL Full [5].

However, RDF is NOT a particular serialization. Though XML was the original specified serialization and still is the defined RDF MIME type (application/rdf+xml; other serializations take the form text/turtle or text/n3 or similar), it is not necessary to either write or transmit RDF in the XML syntax.

In any event, depending on its role and application, we can see that RDF is a foundation, in careful expressions based in description logics, that lends itself to a clean expression and separation of concerns. With RDF and RDFS, we have a data model and a basis for vocabularies well suited to instance data (ABox). With RDFS and OWL, we have an extended schema structure and ontologies suitable for describing and modeling the relationships in the world (TBox). Thus, RDF is a framework for modeling all forms of data, for describing that data through vocabularies, and for interoperating that data through shared conceptualizations (ontologies) and schema.

Rationale for a Canonical Data Model

In the context of data interoperability, a critical premise is that a single, canonical data model is highly desirable. Why? Simply because of 2N v N2. That is, a single reference (“canon”) structure means that fewer tool variants and converters need be developed to talk to the myriad of data formats in the wild. With a canonical data model, talking to external sources and formats (N) only requires converters to the canonical form (2N). Without a canonical model, the combinatorial explosion of required format converters becomes N2 [6].

Note, in general, such a canonical data model merely represents the agreed-upon internal representation. It need not affect data transfer formats. Indeed, in many cases, data systems employ quite different internal data models from what is used for data exchange. Many, in fact, have two or three favored flavors of data exchange such as XML, JSON or the like.

In most enterprises and organizations, the relational data model with its supporting RDBMs is the canonical one. In some notable Web enterprises – say, Google for example – the exact details of its internal canonical data model is hidden from view, with APIs and data exchange standards such as GData being the only visible portions to outside consumers.

Generally speaking, a canonical, internal data standard should meet a few criteria:

  • Be expressive enough to capture the structure and semantics of any contributing dataset
  • Have a schema itself which is extensible
  • Be performant
  • Have a model to which it is relatively easy to develop converters for different formats and models
  • Have published and proven standards, and
  • Have sufficient acceptance so as to have many existing tools and documentation.

Other desired characteristics might be for the model and many of its tools to be free and open source, suitable to much analytic work, efficient in storage, and other factors.

Though the relational data model is numerically the most prevalent one in use, it has fallen out of favor for data federation purposes. This loss of favor is due, in part, to the fragile nature of relational schema, which increases maintenance costs for the data and their applications, and incompatibilities in standards and implementation.

Though still comparatively young with a smaller-than-desirable suite of tools and applications support [7], RDF is perhaps the ideal candidate for the canonical data model. To understand why, let’s now switch our discussion to the advantages of RDF.

Advantages of RDF

It is surprisingly difficult to find a consolidated listing of RDF’s advantages. The W3C, the developer of the specification, first published on this topic in the late 1990s, but it has not been updated for some time [8]. Graham Klyne has a better and more comprehensive presentation, but still one that has not been updated since 2004 [4].

I believe data interoperability to be RDF’s premier advantage, but there are many, many others.

Another advantage that is less understood is that RDF and its progeny can completely switch the development paradigm: data can now drive the application, and not the other way around. Frankly, we are just at the beginning realizations of this phase with such developments as linked data and even whole applications or application languages being written in RDF [9], but I think time will prove this advantage to be game-changing.

But, there are many perspectives that can help tease out RDF’s advantages. Some of these are discussed below, with the accompanying table attempting to list these ‘Top Sixty’ advantages in a single location.

Standard, Open and Expressive

In its ten year history, RDF has spawned many related languages and standards. The W3C has been the shepherd for this process, and there are many entry locations on the World Wide Web Consortium’s Web site to begin exploring these options [10]. These standards extend from the RDF, RDFS and OWL vocabularies and languages noted above that give RDF its range of expressiveness, to query languages (e.g., SPARQL), transformation languages (e.g., GRDDL), rule languages (e.g., RIF), and many additional constructs and standards.

The richness of this base of standards is only now being tapped. The combination of these standards and the tools they are spawning is just beginning. And, because it is so easily serialized as XML, a further suite of tools and capabilities such as XPath or XSLT or XForms may be layered onto this base.

Moreover, one is not limited in any way to XML as a serialization. RDF itself has been serialized in a number of formats including RDF/XML, N3, RDFa, Turtle, and N-triples. Also, RDF’s simple subject-predicate-object data model can readily convert human-readable and easily authored instance records (subject) written in the style of attribute-value pairs (predicate-object). As such, RDF is an excellent conversion target for all forms of naïve data structs [11].

Data Interoperability

Indeed, it is in data exchange and interoperability that RDF really shines. Via various processors or extractors, RDF can capture and convey the metadata or information in unstructured (say, text), semi-structured (say, HTML documents) or structured sources (say, standard databases). This makes RDF almost a “universal solvent” for representing data structure.

“The semantic Web’s real selling point is URI-based data integration.”
Harry Halpin [12]

Because of this universality, there are now more than 100 off-the-shelf ‘RDFizers’ for converting various non-RDF notations (data formats and serializations) to RDF [13]. Because of its diversity of serializations and simple data model, it is also easy to create new converters. Generalized conversion languages such as GRDDL provide framework-specific conversions, such as for microformats.

Once in a common RDF representation, it is easy to incorporate new datasets or new attributes. It is also easy to aggregate disparate data sources as if they came from a single source. This enables meaningful composition of data from different applications regardless of format or serialization.

Simple RDF structures and predicates enable synonyms or aliases to also be easily mapped to the same types or concepts. This kind of semantic matching is a key capability of the semantic Web. It becomes quite easy to say that your glad is my happy, and they indeed talk about the same thing.

What this mapping flexibility points to is the immense strengths of RDF in representing diverse schema, the next major advantage.

Schema Unbound

The single failure of data integration since the inception of information technologies — for more than 30 years, now — has been schema rigidity or schema fragility. That is, once data relationships are set, they remain so and can not easily be changed in conventional data management systems nor in the applications that use them.

Relational database management (RDBM) systems have not helped this challenge, at all. While tremendously useful for transactions and enabling the addition of more data records (instances, or rows in a relational table schema), they are not adaptive nor flexible.

Why is this so?

In part, it has to do with the structural “view” of the world. If everything is represented as a flat table of rows and columns, with keys to other flat structures, as soon as that representation changes, the tentacled connections can break. Such has been the fragility of the RDBMS model, and the hard-earned resistance of RDBMS administrators to schema growth or change.

Yet, change is inevitable. And thus, this is the source of frustration with virtually all extant data systems.

RDF has no such limitations. And, for those from a conventional data management perspective, this RDF flexibility can be one of the more unbelievable aspects of this data model.

As we have noted earlier, RDF is well suited and can provide a common framework to represent both instance data and the structures or schema that describe them, from basic data records to entire domains or world views. In fact, whatever schema or structure that characterizes the input data — from simple instance record layouts and attributes to complete vocabularies or ontologies — also embodies domain knowledge. This structure can be used at time of ingest as validity or consistency checks.

As a framework for data interoperability, RDF and its progeny can ingest all relations and terminology, with connections made via flexible predicates that assert the degree and nature of relatedness. There is no need for ingested records or data to be complete, nor to meet any prior agreement as to structure or schema.

Increment, Evolve, Extend, Adapt . . .

Indeed, the very fluidity of RDF and structures based on it is another key strength. Since a basic RDF model can be processed even in the absence of more detailed information, input data and basic inferences can proceed early and logically as a simple fact basis. This strength means that either data or schema may be ingested and then extended in an incremental or partial manner. Partial representations can be incorporated as readily as complete ones, and schema can extend and evolve as new structure is discovered or encountered.

This is revolutionary. RDF provides a data and schema representation framework that can evolve and adapt to what data exists and what structure is known. As new data with new attributes are discovered, or as new relationships are found or realized, these can be added to the existing model without any change whatsoever to the prior existing schema.

This very adaptability is what enables RDF to be viewed as data-driven design. We can deal with a partial and incomplete world; we can learn as we go; we can start small and simple and evolve to more understanding and structure; and we can preserve all structure and investments we have previously made.

And applications based on RDF work the same way: they do not need to process or account for information they don’t know or understand. We can easily query RDF models without being affected whatsoever by unreferenced or untyped data in the basic model.

By replacing the rigid relational data model with one based on RDF, we gain robustness, flexibility, universality and structural persistence over fragility.

Existing technologies such as SQL and the relational model were devised without the specific requirements of disparate, uncontrolled, large-scale integration. Though the relational model enabled us to build efficient data silos and transaction systems, RDF now enables us to finally federate them.

‘Top Sixty’ Benefits of RDF
  • A foundation based in description logics that lends itself to clean expression and separation of concerns regarding instance data (ABox) and schema structure (TBox)
  • RDF’s unique identifiers, IRIs, are Web-compatible and are Web-scalable
  • Potential use of inferencing to contextually broaden search, retrieval and analysis
  • Potential use of its structure to automatically drive applications and tools, including populating context-relevant dropdown lists and auto-completion
  • Based on open source, languages and standards
  • A comparatively complete suite of specifications including languages, schema and tools (e.g., RDF, RDF Schema, OWL, RIF, SPARQL, GRDDL, etc.)
  • A choice of a variety of serializations and notations, including RDF/XML, N3, RDFa, Turtle, N-triples, as well as possible expression in many non-RDF notations
  • Instance records in human-readable, easily authored attribute-value formats can be readily converted to the s-p-o RDF “triple” data model
  • Can capture metadata and structure from unstructured, semi-structured and structured data
  • More than 100 off-the-shelf ‘RDFizers’ exist for converting various non-RDF notations (data formats and serializations)
  • Easy and cost-effective incorporation of new datasets wherein only new attributes require a structure update; all others simply get mapped
  • Aggregate processing of disparate sources as if they came from a single source
  • A ready structure for synonym and alias matching when merging or matching datasets
  • In converting non-RDF data, the ability to bring a more formal class structure to the description of things
  • Common framework and vocabulary for representing instance data
  • Common framework and vocabulary for representing data structures and schema
  • Can describe simple data structs to complete vocabularies/ontologies to processing and inferencing rules
  • Schema can be calculated from the ingested triples; thus, can either generate schema from scratch or be used to cross-check prior schema
  • Can accept and store data with different structure in a general RDF container (e.g., all animals v a specific bird)
  • Eliminates the trade-off between good design and performance for related structure (e.g., full names v first and last names)
  • Untyped relations can still have single operations performed against them
  • More formal RDF structures (e.g., ontologies) embody domain expertise within their subject structure
  • Readily extensible with schema that are also machine readable, bringing about a high degree of automation
  • Allows data that is structured slightly differently to be stored together in the lowest common denominator of an RDF statement
  • No need for upfront schema agreement; can evolve, extend and adapt
  • Allows the schema to change independently of the data without requiring any existing data to be thrown away or padded with NULLs
  • The basic RDF model can be processed in the absence of more detailed information as a simple fact basis
  • Schema based on RDF can be extended and grown incrementally without impacting the existing datastore
  • As a corollary, development based on RDF can also be incremental, reducing the need to “design it at once” or “design it right” up front
  • RDF models and apps lend themselves to experimentation and agile development
  • Information can be gathered incrementally from multiple sources
  • Data and schema can be ingested, represented and conveyed in “partial” form
  • Structure and schema can evolve incrementally in concert with new understandings and new data
  • All prior investments in structure and schema can be maintained
  • Because of conceptual closeness to the relational data model, it is possible to represent RDF in a relational database and vice versa
  • RDF thus has the ability to take advantage of historical RDBMs and SQL query optimizations
  • Ability to create RDF “views” or wrappers over relational schema that can be queried via SPARQL
  • A common storage format based on the triple or quad; suitable for datastore hosting by relational database management systems
  • The use of untyped relations reduces the total number of relations to be handled, with operations over them only needed once
  • Relational systems can serve instance data in situ (ABox) while interoperability is provided by an RDF structural and schema layer (TBox)
  • Ability to do specialized work, such as inferencing
  • Use of a set-based semantics and queries
  • Via its SPARQL query language, easy mechanisms to drive faceted search and other browsing and viewing tools
  • Because of how RDF works it is possible to query a dataset without knowing anything about the data in advance
  • Ability to generalize selection, viewing and publishing tools driven solely from the RDF structure; as the structure changes, tools automatically reflect those changes (e.g., plug-and-play)
  • Can easily create and apply inferencing tables over RDF datastores [14]
  • The RDF graph brings all the advantages and generality of structuring information using graphs
  • A graph is, itself, a unique form of data type with unique algorithms and analytic features
  • Graphs are modular and can be readily combined or broken apart
  • Graphs can be used for scalable, parallelized information processing
  • Unique types of search and discovery can occur with RDF graphs
  • Graphs provide the ability to visualize and navigate large network structures
  • Queries are unaffected by unreferenced values in the source data
  • Emerging lingua franca of the semantic Web and ‘Web of data’
  • Strong compatibility with “linked data” based on Web access (HTTP) and IRI identifiers
  • RDF is readily adaptable to the open-world assumption (OWA)
  • Relation to the semantic Web means much global information and data can be admixed with local content
  • Across all global sources the potential for powerful data “mesh-ups” conjoining related information
  • Network effects such as shared vocabularies, shared background knowledge, collective authoring, annotating and curating, and
  • RDF is an emergent data model.

Yet, Still Kissing Cousins with the Relational Model

Despite these differences in fragility and robustness, there are in fact many logical and conceptual affinities between the relational model and the one for RDF. An excellent piece on those relations was written by Andrew Newman a bit over a year ago [15].

RDF can be modeled relationally as a single table with three columns corresponding to the subject-predicate-object triple. Conversely, a relational table can be modeled in RDF with the subject IRI derived from the primary key or a blank node; the predicate from the column identifier; and the object from the cell value. Because of these affinities, it is also possible to store RDF data models in existing relational databases. (In fact, most RDF “triple stores” are RDBM systems with a tweak, sometimes as “quad stores” where the fourth tuple is the graph.) Moreover, these affinities also mean that RDF stored in this manner can also take advantage of the historical learnings around RDBMS and SQL query optimizations.

Just as there are many RDFizers as noted above, there are also nice ways to convert relational schema to RDF automatically. OpenLink Software, for example, has its RDF “Views” system that does just that [16]. Given these overall conceptual and logical affinities the W3C is also in the process of graduating an incubator group to an official work group, RDB2RDF [17], focused on methods and specifications for mapping relational schema to RDF.

What is emerging is one vision whereby existing RDBM systems retain and serve the instance records (ABox), while RDF and its progeny provide the flexible schema scaffolding and structure over them (TBox). Architectures such as this retain prior investments, but also provide a robust migration path for interoperating across disparate data silos in a performant way.

Data-driven Applications

As developers, one of our favorite advantages of RDF is its ability to support data-driven applications. This makes even further sense when combined with a Web-oriented architecture that exposes all tools and data as RESTful Web services [18].

Two tool foundations are the RDF query language, SPARQL [19], and inferencing. SPARQL provides a generalized basis for driving reports and templated data displays, as well as standard querying. Utilizing RDF’s simple triple structure, SPARQL can also be used to query a dataset without knowing anything in advance about the data. This provides a very useful discovery mode.

Simple inferencing can be applied to broaden and contextualize search, retrieval and analysis. Inference tables can also be created in advance and layered over existing RDF datastores [14] for speedier use and the automatic invoking of inferencing. More complicated inferencing means that RDF models can also perform as complete conceptual views of the world, or knowledge bases. Quite complicated systems are emerging in such areas as common sense (with OpenCyc) and biological systems [20], as two examples.

RDF ontologies and controlled vocabularies also have some hidden power, not yet often seen in standard applications: by virtue of its structure and label properties, we can populate context-relevant dropdown lists and auto-complete entries in user interfaces solely from the input data and structure. This ability is completely generalizable solely on the basis of the input ontology(ies).

A Graph Representation

As the intro noted, when RDF triples get combined, a graph structure emerges. (Actually, it can most formally be described as a directed graph.) A graph structure has many advantages. While we are seeing much starting to emerge in the graph analysis of social networks, we could also fairly argue that we are still at the early stages of plumbing the unique features of graph (”network”) structures.

Graphs are modular and can be both readily combined and broken apart. From a computational standpoint, this can lend itself to parallelized information processing (and, therefore, scalability). With specific reference to RDF it also means that graph extractions are themselves valid RDF models.

Graph algorithms are a significant field of interest within mathematics, computer science and the social sciences. Via approaches such as network theory or scale-free networks, topics such as relatedness, centrality, importance, influence, “hubs” and “domains”, link analysis, spread, diffusion and other dynamics can be analyzed and modeled.

Graphs also have some unique aspects in search and pattern matching. Besides options like finding paths between two nodes, depth-first search, breadth-first search, or finding shortest paths, emerging graph and pattern-matching approaches may offer entirely new paradigms for search.

Graphs also provide new approaches for visualization and navigation, useful for both seeing relationships and framing information from the local to global contexts. The interconnectedness of the graph allows data to be explored via contextual facets, which is revolutionizing data understanding in a way similar to how the basic hyperlink between documents on the Web changed the contours of our information spaces [21].

Many would argue (as do I), that graphs are the most “natural” data structure for capturing the relationships of the real world. If so, we should continue to see new algorithms and approaches emerge based on graphs to help us better understand our information. And RDF is a natural data model for such purposes.

Open World Applications and the Semantic Web

Ultimately, data interoperability implies a global context. The design of RDF began from this perspective with the semantic Web.

This perspective is firstly grounded in the open-world assumption: that is, the information at hand is understood to be incomplete and not self-contained. Missing values are to be expected and do not falsify what is there. A corollary assumption is there is always more information that can be added to the system, and the design should not only accommodate, but promote, that fact.

As the lingua franca for the semantic Web, using RDF means that many new data, structures and vocabularies now become available to you. So, not only can RDF work to interoperate your own data, but it can link in useful, external data and schema as well.

Indeed, the concept of linked data now becomes prominent whereby RDF data with unique IRIs as their universal identifiers are exposed explicitly to aid discovery and interlinking. Whether internal data is exposed in the linked data manner or not, this external data can now be readily incorporated into local contexts. The Linking Open Data movement that is promoting this pattern has become highly successful, with billions of useful RDF statements now available for use and consumption [10].

The semantic Web and RDF is enabling the data federation scope to extend beyond organizational boundaries to embrace (soon) virtually all public information. That means that, say, local customer records can now be supplemented with external information about specific customers or products. We are really just at the nascent stages of such data “mesh-ups” with many unforeseen benefits (and, challenges, too, such as privacy and identity and ownership) likely to emerge.

At Web scales, we will see network effects also emerge in areas such as shared vocabularies, shared background knowledge, and collective authoring, annotating and curating. To be sure the traditional work of trade associations and standards bodies will continue, but likely now in much more operable ways.

Myths of RDF

Throughout the years, a number of myths have grown up around RDF. Some, unfortunately, were based on the legacy of how RDF was first introduced and described. Other myths arise from incomplete understanding of RDF’s multiple roles as a framework, data model, and basis for vocabularies and conceptual descriptions of the world.

The accompanying table lists the “Top Ten” of myths I have found to date. I welcome other pet submissions. Perhaps soon we can get to the point of a clearer understanding of RDF.

‘Top Ten’ Myths of RDF
  1. RDF is equivalent to XML — perhaps the biggest PR error in RDF’s first introduction was to tie RDF so closely with XML. RDF is a data model as described herein that has no dependence on XML and exists in abstract form separate from it
  2. RDF is written or expressed in XML — in a related way, RDF can be serialized (expressed) in many forms other than XML
  3. RDF and OWL are independent — OWL is a language grounded in RDF and a natural extension of the RDF “stack”; OWL is at the full expressiveness end of the spectrum [22, 23]
  4. RDF is a serialization — no; XML is a serialization, RDF is a data model, framework and basis for constructing vocabularies
  5. Basic RDF has no semantics — though limited and purposefully free, basic RDF in fact has extremely well considered semantics; an essential document for any practitioner is [3]
  6. RDF is too complex — it depends, right? At the level of the basic triple, RDF is extremely simple and is the best place to start learning about RDF
  7. RDF is too simple — it depends, right? At the level of OWL ontologies, RDF can capture virtually any relationship and aspect of the world; see [5] for a great start
  8. RDF is useful for “large” datasets only — the real purpose of RDF is data interoperability, which is needed any time two or more datasets are combined, regardless of size
  9. (Conversely and paradoxically), RDF is not scalable — this premise is still being tested, but we now have very large-scale experience with the government and in the Billion Triples Challenge
  10. RDF is not performant — daily we keep learning more about optimizations, query and re-write strategies, and the like. Orri Erling [24] does some of the best work around in this area and writes lucid explanations on his blog. Moreover, RDF systems are easily embedded in WOA architectures, which prove themselves daily at global Web scales.

Conclusion

Emergence is the way complex systems arise out of a multiple of relatively simple interactions, exhibiting new and unforeseen properties in the process. RDF is an emergent model. It begins as simple “fact” statements of triples, that may then be combined and expanded into ever-more complex structures and stories.

As an internal, canonical data model, RDF has advantages over any other approach. We can represent, describe, combine, extend and adapt data and their organizational schema flexibly and at will. We can explore and analyze in ways not easily available with other models.

And, importantly, we can do all of this without the need to change what already exists. We can augment our existing relational data stores, and transfer and represent our current information as we always have.

We can truly call RDF a disruptive data model or framework. But, it does so without disrupting what exists in the slightest. And that is a most remarkable achievement.

 


[1] Actually, it is just a few weeks past. The first RDF specification was published as: Ora Lassila and Ralph R. Swick, eds., 1999. “Resource Description Framework (RDF) Model and Syntax Specification,” W3C Recommendation, 22 February 1999; see http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/. Of course, RDF had been in development under various names for some time. To my knowledge, the first public explanation specific to the RDF name was by Tim Bray, “RDF and Metadata,” on XML.com, June 09, 1998; see http://www.xml.com/pub/a/98/06/rdf.html.  I’m measuring RDF’s birthday in relation to it being published as an official standard (recommendation) per the first reference.
[2] I first recommend an older introduction by Ian Davis, http://research.talis.com/2005/rdf-intro/. There is a more recent, shorter version by Davis and Tom Heath, The 30 Minute Guide to RDF and Linked Data, at http://www.slideshare.net/iandavis/30-minute-guide-to-rdf-and-linked-data. Also, Joshua Tauberer’s write-up at http://www.rdfabout.com/intro/? is quite excellent.
[3] Patrick Hayes, 2004. “RDF Semantics,” a W3C Recommendation, February 2004. See http://www.w3.org/TR/rdf-mt/.
[4] Graham Klyne, 2004. “Semantic Web and RDF,” on the Nine by Nine Web site (http://www.ninebynine.net/), 4 May 2004; see http://www.ninebynine.org/Presentations/20040505-KelvinInsitute.pdf.
[5] The soon-to-be-released recommendation of OWL 2 is best introduced through the recent: OWL 2 Working Group, eds., 2009. “OWL 2 Web Ontology Language: Document Overview,” W3C Working Draft, 27 March 2009; see http://www.w3.org/TR/owl2-overview/.
[6] The canonical data model is especially prevalent in enterprise application integration. An interesting animated visualization of the canonical data model may be found at: http://soa-eda.blogspot.com/2008/03/canonical-data-model-visualized.html.
[7] Still, my own Sweet Tools listing of RDF and -related tools now contains nearly 800 items.
[8] The RDF Advantages Page; see http://www.w3.org/RDF/advantages.html.
[9] See, for example, Neno, the Semantic Web Programming Environment, at: http://neno.lanl.gov/Home.html; and Ripple, at http://code.google.com/p/ripple/. The developers of these systems are now combining efforts.
[10] Here are some useful starting points for RDF at the World Wide Web Consortium (W3C): Begin at the W3C’s ESW wiki. The Linking Open Data community maintains its own people and projects listings as well. Current topics are discussed on the W3C’s semantic Web mailing lists. The W3C maintains a good general semantic Web tools, with specific listings of RDF Triplestores.
[11] Michael Bergman, 2009. “‘Structs’: Naïve Data Formats and the ABox,” on the AI3 blog, January 22, 2009; see http://www.mkbergman.com/?p=471. And, Ibid, 2009. “‘Making Linked Data Reasonable using Description Logics, Part 4,” on the AI3 blog, February 23, 2009; see http://www.mkbergman.com/?p=478.
[12] Harry Halpin, video interview with Marcos Caceres, “GRDDL, Bridging the Interwebs?,” August 4, 2008, on StandardsSuck.org. See http://standardssuck.org/grddl-bridging-the-interwebs.
[13] See, for example, these Virtuoso RDF cartridges (http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtSponger) or listing of RDFizers (http://simile.mit.edu/wiki/RDFizers).
[14] OpenLink Software, 2009. “17.6. Inference Rules & Reasoning,”, part of the online Virutoso User Manual; see: http://docs.openlinksw.com/virtuoso/rdfsparqlrule.html.
[15] Andrew Newman, 2007. “A Relational View of the Semantic Web,” published on XML.com, March 14, 2007; see http://www.xml.com/pub/a/2007/03/14/a-relational-view-of-the-semantic-web.html.
[16] OpenLink Software, 2009. “17.4.3. RDF Views over RDBMS Data Source,” part of the online Virutoso User Manual; see: http://docs.openlinksw.com/virtuoso/rdfsparqlintegrationmiddleware.html#rdfviews. Also see OpenLink Software, 2007. Virtuoso RDF Views – Getting Started Guide, v1.1, June 2007; see http://virtuoso.openlinksw.com/Whitepapers/pdf/Virtuoso_SQL_to_RDF_Mapping.pdf.
[17] W3C, 2009. RDB2RDF Working Group Charter, revised February 24, 2009; see http://www.w3.org/2005/Incubator/rdb2rdf/WG-draft-charter/.
[18] See further my various blog posts on Web-oriented architecture (WOA).
[19] Especially recommended as an introductory tutorial is: Lee Feigenbaum, 2008. “SPARQL By Example: A Tutorial,” Sept. 17, 2008; see http://www.cambridgesemantics.com/2008/09/sparql-by-example.
[20] Many disciplines are embracing RDF. But, in biology, some exemplar projects are the Bio2RDF genomics project; the Linking Open Drug Data (LODD) initiative, which is a sub-project of the W3C’s broader Health Care and Life Sciences Interest Group (HCLSIG); the Neurocommons project; and the RDF branches of the Open Biomedical Ontologies (OBO) project and foundry.
[21] A very nice visualization of graph-driven structures in relation to information discovery and navigation is provided by Rama Hoetzlein, 2007. Quanta: The Organization of Human Knowedge: Systems for Interdisciplinary Research, a Master’s Thesis, University of California, Santa Barbara, June 2007; see http://www.rchoetzlein.com/quanta/index.htm.
[22] The original phrasing of this Myth used the term “distinct”, which Ted Thibodeau Jr rightly questioned. This myth goes to the heart of what I think is a false separation of the RDF and OWL “camps”. As the intro noted, I see a natural progression from RDF –> RDFS –> OWL, with the transition representing more precise semantics and expressiveness. Describing simple things simply, especially for linked data as mostly practiced, works well in RDF and RDFS. Once world views and conceptual schema are desired for inter-relating these things, RDFS and OWL become the better option. OWL Full (including OWL 2, see [23]) is fully grounded in RDF semantics. However, since OWL Full is not decidable, a subset of that, OWL DL, is still expressible as RDF but now consistent with description logics. This approach can provide more inferencing and reasoning power, at the slight cost of greater care in the semantics used and relationships asserted. In the end, the “distinction” between RDF and OWL is really a difference in use cases and intentions, imo.
[23] Michael Schneider, ed., 2009. OWL 2 Web Ontology Language RDF-Based Semantics, W3C Working Draft 21 April 2009; see http://www.w3.org/TR/owl2-rdf-based-semantics/.
[24] See Orri Erling’s Weblog at: http://www.openlinksw.com/weblogs/oerling/.

Joshua TaubererSPARQL OLE DB Provider

Andy Gueritz announced on the mail list for my SemWeb RDF library for .NET that he has created an OLE provider for a SPARQL endpoint that is usable in Microsoft Excel. He wrote,

In a moment of insanity (but a great learning experience), I gave myself the challenge of writing an OLE DB provider for SPARQL. It is built on top of the SemWeb libary which has saved a substantial amount of effort and also brings some powerful functionality to the table very quickly (Thanks, Joshua!)

The provider as constructed implements a readonly OLE provider that supports all four SPARQL query types and interfaces to SemWeb through COM-Callable Wrapper. It is not extensively tested yet but seems to work with most of the queries I have now put through it, and of course being built on SemWeb it is able to read both local and remote SPARQL sources.

Moral of the story: populate Excel tables with SPARQL queries.

More here.

AI3:::Adaptive Information (Mike Bergman)Great Series on REST and Resource-Oriented Architecture

Whether You Call it ROA or WOA, It Simply Works

Brian Sletten, newly with Riot Games, has just completed his four-part REST for Java developers series on JavaWorld.com.  The series, which began last October, was wrapped up this week.

The series is highly readable and a real keeper:

The printable versions are the easiest to read since you don’t have to get stuck in JavaWorld’s annoying document-splitting style.

Brian prefers to use Sam Ruby’s term of resource-oriented architecture (ROA), though I have preferred to use Nick Gall’s Web-oriented architecture (WOA) nomenclature. In any case, the series is highly informative and clearly written and is sufficiently general in Parts 1 and 4 and most of Part 3 to be of use to non-Java developers.

What is quite interesting is that Brian also makes the connection between a REST Web service style and linked data in Part 4, as I suggested a few months back. (In fact, for those familiar with REST, I recommend you start with Part 4.) Again, I think RESTful Web services in combination with RDF and linked data points to the winning and performant architecture of the foreseeable future.

Thanks for a great series, Brian!

DBpedia BlogDBpedia at the MediaWiki Developer Meet-Up (April 3.-5. in Berlin)

Chris Bizer and Christian Becker will be at the MediaWiki Developer Meet-Up  in Berlin this week-end.

So if you are also there and you are interested in DBpedia, just grap us.

If we manage to get hold of the beamer and people are interested, we might also present the following slides

http://www4.wiwiss.fu-berlin.de/bizer/pub/WikiMediaDevMeeting-DBpedia-Talk.pdf

See you at C-Base,

Chris

Orri ErlingWeb Scale and Fault Tolerance

One concern about Virtuoso Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale Linked Data Web. We will see how large scale, continuous operation, and redundancy are related.

It has been said many times — when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema information in synchronous replicas on all servers.

Approaches to Fault Tolerance

Fault tolerance has many aspects but it starts with keeping data in at least two copies. There are shared-disk cluster databases like Oracle RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed.

With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks.

There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough.

In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous.

This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application's requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context.

Basics of Partition Fail-Over

For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site.

The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application.

Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application.

When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled — but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier.

For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later.

If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved.

As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action.

Levels of Tolerance and Performance

The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby.

Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources.

The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage.

Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory.

Background Bulk Processing

When loading data, the system is online in principle, but query response can be quite bad. A large RDF load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server's partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference.

But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate.

Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost.

This technique applies to all data intensive background tasks — calculation of entity search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating.

Configurations of Redundancy

Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will.

We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers' RAM when in normal operation.

There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon's Dynamo does something similar at the database level. The analogies are not exact, though.

If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient.

Maintenance and Continuity

Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime.

Present Status

The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.

Footnotes