Follow Us:

Call Now! +39 0761 1916790

Apache Solr: how to shuffle results

You built a cool e-commerce portal on top of Apache Solr; customers are sending you their data and you index everything. Sometimes, at query time, the first top results always belong to the same shop or the same brand, even if other shops/brands have that type of product.

For instance, a search for “shirt”

  • returns 5438 results in 109 pages (60 results / page)
  • the first 118 results (the first two pages) belong to the “C0C0BABE” brand
  • starting from the 119 result, other brands appear

This could be a problem, because sooner or later other brands will complain: the impression rate of the third page is definitely lower than the first page, and it seems like your website is selling only items from “C0C0BABE”.

Results need to be sorted by score, another criterion would necessarily compromise the computed relevancy. What can we do?

A useful Solr feature for such context is Query Re-Ranking [1]; I know, it is not a new feature, it has been introduced in Solr very long time ago but I never met that before (“Mater artium necessitas“)

From the official Solr Reference Guide:

“Query Re-Ranking allows you to run a simple query (A) for matching documents and then re-rank the top N documents using the scores from a more complex query (B). Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself”

The component interface is very simple. You need to provide three parameters:

  • reRankQuery: this is the query that will be used in the re-ranking phase
  • reRankDocs: the minimum number of top N results to re-rank; Solr could increase that number during the re-ranking
  • reRankWeight: a multiplicative factor applied to the score of the documents matching the reRankQuery and, at the same time, belonging to the top reRankDocs set. For each of them, that additional score will be added to the original score of the document (i.e. the score resulting from the main query)

It seems a perfect fit! But what about the reRankQuery?? It should emulate a random behaviour, like random querying a field with a non-structured content. And that has been exactly what my approach: once identified a candidate non-structured field in the schema (product_description in this case) I’ve created a copy in another searchable “shuffler” field, with a minimal text analysis (standard tokenisation, lowercasing):

<field name="shuffler" type="unstemmed-text" indexed="true" .../>
<copyField src="prd_descr" dest="shuffler"/>

Then, I’ve configured the request handler with the re-rank parameters as follow:

<str name="rqq">
  {!lucene q.op=OR df=shuffler v=$rndq}
<str name="rq">
  {!rerank reRankQuery=$rqq reRankDocs=220 reRankWeight=1.2}

As you can see I”ve used a plain Solr query parser for querying the “shuffler” field mentioned above. What about the $rndq parameter? That is the query, which should contain a (probably long) list of terms. I defined a default value like this:

<str name="rndq">
  (just top bottom button style fashion up down chic elegance ... )

What is the goal here? The default operator of the query parser has been set to OR so the reRankQuery will give a chance to the first reRankDocs to collect an additional “bonus” score if their shuffler field contains one or (better) more terms provided in the $rdnq parameter.

The default value, of course, will be always the same, but a client could provide an its own $rndq parameter with a list of terms different for each request.

For the other parameters (reRankWeight and reRankDocs) those are the values that work for me…you should run some test with your dataset and try to adjust them.



Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java technology. He strongly loves coding and definitely likes to be considered a developer. Andrea has more than 15 years of experience in various software engineering areas, from telecommunications to banking. He has worked for several medium- and large-scale companies, such as IBM and Orga Systems. Andrea has several certifications in the Java programming language (programmer, developer, web component developer, business component developer, and JEE architect), BEA products (build and portal solutions), and Apache Solr (Lucid Apache Solr/Lucene Certified Developer).

No Comments

Sorry, the comment form is closed at this time.