Follow Us:

Call Now! +39 0761 1916790

Load RDF data in Solr

Load RDF data in Solr

The Solr built-in UpdateRequestHandler supports several formats of input data. It delegates the actual data loading to a specific ContentStreamLoader, depending on the content type of the incoming request (i.e. the Content-type header of the HTTP request). Currently, these are the available content types declared in the UpdateRequestHandler class:

  • application/xml or text/xml
  • application/json or text/json
  • application/csv or text/csv
  • application/javabin

So, a client has several options to send its data to Solr; all what it needs is to prepare those data in a specific format and call the UpdateRequestHandler (usually located at /update endpoint) specifying the corresponding content type

curl http://localhost:8080/solr/update 
     -H "Content-Type: text/json" \ 
     --data-binary @/home/agazzarini/data.json

The UpdateRequestHandler can be extended, customized, and replaced; so we can write our own UpdateRequestHandler that accepts a custom format, adding a new content type or overriding the default set of supported content types.

In this brief post, I will describe how to use Jena to load RDF data in Solr, in any format supported by Jena IO API.
This is a quick and easy task mainly because:

  • the UpdateRequestHandler already has the logic to index data
  • the UpdateRequestHandler can be easily extended
  • Jena already provides all the parsers we need

So doing that, is just a matter of subclassing UpdateRequestHandler in order to override the content type registry:

public class RdfDataUpdateRequestHandler extends UpdateRequestHandler {
  protected Map createDefaultLoaders(NamedList parameters) {
    final Map<String, ContentStreamLoader> registry = 
                    new HashMap<String, ContentStreamLoader>();
    final ContentStreamLoader loader = new RdfDataLoader();
    for (final Lang language : RDFLanguages.getRegisteredLanguages()) {
      registry.put(language.getContentType().toHeaderString(), loader);
    }
    return registry;
}
  • text/turtle
  • application/turtle
  • application/x-turtle
  • application/rdf+xml
  • application/rdf+json
  • application/ld+json
  • text/plain (for n-triple)
  • application/n-triples
  • (others)

Our RdfDataLoader will be in charge to parse and load the data. Note that the above list is not exhaustive, there a lot of other content types registered in Jena (See the RDFLanguages class).

So, what about the format of the data? Of course, it still depends on the content type of your RDF data, and most important, it has nothing to do with those data we used to send to Solr (i.e. SolrInputDocuments serialized in some format).

The RdfDataLoader is a subclass of ContentStreamLoader

public class RdfDataLoader extends ContentStreamLoader

public void load()
            SolrQueryRequest request,
            SolrQueryResponse response,
            ContentStream stream,
            UpdateRequestProcessor processor) throws Exception {

   PipedRDFIterator<Triple> iterator = 
            new PipedRDFIterator<Triple>();
   PipedRDFStream<Triple> inputStream = 
            new PipedTriplesStream(iterator);     
   
   // We use an executor for running the parser in a separate thread
   ExecutorService executor = Executors.newSingleThreadExecutor();
   Runnable parser = new Runnable() {
     public void run() {
       try {
         RDFDataMgr.parse(
            inputStream,
            stream.getStream(),
            RDFLanguages.contentTypeToLang(stream.getContentType()));
       } catch (final IOException exception) {
            ... 
       }
    }
  };

  executor.submit(parser);
  while (iterator.hasNext()) { 
    Triple triple = iterator.next();
    // create and populate the Solr input document
    SolrInputDocument document = new SolrInputDocument(); 
    ...
    // create the update command 
    AddUpdateCommand command = new AddUpdateCommand(request);
    command.solrDoc = document;

    processor.processAdd(command);
  }  
}

That’s all! Once the request handler has been registered in Solr we can send RDF data to Solr using a command like this:

curl http://localhost:8080/solr/store/update \
     -H "Content-Type: application/n-triples" \
     --data-binary @/home/agazzarini/triples_dogfood.nt
Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java technology. He strongly loves coding and definitely likes to be considered a developer. Andrea has more than 15 years of experience in various software engineering areas, from telecommunications to banking. He has worked for several medium- and large-scale companies, such as IBM and Orga Systems. Andrea has several certifications in the Java programming language (programmer, developer, web component developer, business component developer, and JEE architect), BEA products (build and portal solutions), and Apache Solr (Lucid Apache Solr/Lucene Certified Developer).

No Comments

Post a Comment

Comment
Name
Email
Website