image_pdfimage_print

ElasticSearch is annoyingly complicated at times. You can run a search and it runs the wrong results and you are not made aware of that.

This can happen when, for example, you have a nested JSON document, i.e., one JSON document inside another. This is because Lucene (i.e., ElasticSearch) query has no understanding of object hierarchy in a JSON document.

The Problem with Searching for nested JSON objects

To illustrate the problem and the solution, download this program massAdd.py and change the URL to match your ElasticSearch environment. Then run it.

Then look at loaded data. You can see from the brackets that classes is a JSON array. But the index, as we will see, does not reflect that.

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search?pretty}

"_index" : "universities",
        "_type" : "universities",
        "_id" : "428b71b0b935eaf58bdf874c819f263a919da5f3",
        "_score" : 0.0,
        "_source" : {
          "firstName" : "Stephen",
          "classes" : [
            {
              "name" : "French",
              "grades" : 2
            },
            {
              "name" : "French",
              "grades" : 3
            },
            {
              "name" : "French",
              "grades" : 2
            },
            {
              "name" : "physics",
              "grades" : 6
            },
            {
              "name" : "physics",
              "grades" : 5
            }
          ],
          "lastName" : "Shakespeare",
          "school" : "Arizona State University"
        }
      }
    ]
  }

Notice the index mapping does not show the JSON array. This is not because there are no brackets []. Rather the nested word is missing.

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities?pretty

...

"classes" : {
            "properties" : {
              "grades" : {
                "type" : "long"
              },
              "name" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          },
...

So run a search using what you would think would the logical way to find students taking physics who got a grade of 1:

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search/?pretty=true -d '{
    "query": {
                "bool" : {
                    "must" : [
                    { "match" : {"classes.name" : "physics"} },
                    { "match" : {"classes.grades" :  1} }
                    ]
            }
        }
}'

Produces this incorrect result. This is because Lucene flattens the whole document and finds both a classes.grades 1 and classes.name physics.

{
        "_index" : "universities",
        "_type" : "universities",
        "_id" : "5846c01f90d80448ab087bf4f476cd3f8ab6f683",
        "_score" : 1.621972,
        "_source" : {
          "classes" : [
            {
              "grades" : 2,
              "name" : "physics"
            },
            {
              "grades" : 2,
              "name" : "French"
            },
            {
              "grades" : 3,
              "name" : "physics"
            },
            {
              "grades" : 1,
              "name" : "math"
            }
          ],
          "firstName" : "Stephen",
          "lastName" : "Rowe",
          "school" : "University of South Carolina - Columbia"
        }
      }

Making the Index Explicit

You can create an index in ElasticSearch just by loading data. That creates it on-the-fly. But you cannot control the way the index is created if you do that. So let’s make the index creation explicit and make clear that classes is a JSON array.

First delete the index, which will also delete all the data.

curl -XDELETE http://parisx:9200/universities

Then create a new index and make it explicit that classes is an object inside of the universities object by using “type” : “nested”. Note: that we are not filling in university data, just students. So ignore those fields. (We will use university data in another post and used it in the previous post).

curl -XPUT --header 'Content-Type: application/json' http://parisx:9200/universities -d '{
    "mappings" : {
      "universities" : {
        "properties" : {
          "Address" : { "type" : "text"},
          "AdminEmail" : { "type" : "text"},
          "AdminName" : { "type" : "text" },
          "AdminPhone" : { "type" : "text" },
          "DapipId" : { "type" : "text" },
          "Fax" : { "type" : "text" },
          "GeneralPhone" : { "type" : "text" },
          "LocationName" : { "type" : "text" },
          "LocationType" : { "type" : "text" },
          "OpeId" : { "type" : "text" },
          "ParentDapipId" : { "type" : "text" },
          "ParentName" :  { "type" : "text" },
          "UpdateDate" : { "type" : "text" },
          "classes" : {
                "type" : "nested",
                "properties" : {
                   "name" : { "type" : "text"},
                   "grades" : { "type" : "integer" }
              }
            },
        "firstName" : { "type" : "text" },
          "lastName" : { "type" : "text" },
          "school" : { "type" : "text" }
            }
          }
        }
}'

Then load the data again by running massAdd.py again:

Now add the terms nested and path to the query to find students who got a grade of 1 in physics. The path is the parent object in the JSON, which in this case is classes. And use classes.name to search by class name:

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search/?pretty=true -d '{
    "query": {
        "nested" : {
            "path" : "classes",
            "score_mode" : "avg",
            "query" : {
                "bool" : {
                "must" : [
                    { "match" : {"classes.name" : "physics"} },
                    { "match" : {"classes.grades" : 1} }
                    ]
                }
            }
        }
    }
}'

results in the correct result.

{
        "_index" : "universities",
        "_type" : "universities",
        "_id" : "20ce682e16d0b88487addc9ceba2a695d57692aa",
        "_score" : 2.188588,
        "_source" : {
          "lastName" : "Shakespeare",
          "firstName" : "Julie",
          "classes" : [
            {
              "grades" : 1,
              "name" : "physics"
            },
            {
              "grades" : 6,
              "name" : "math"
            }
          ],
          "school" : "Arizona State University"
        }
      }
    ]

Automate big data workflows to simplify and accelerate your big data lifecycle

In this e-book, you’ll learn how you can automate your entire big data lifecycle from end to end—and cloud to cloud—to deliver insights more quickly, easily, and reliably.
Read the e-book ›
Last updated: 11/16/2018

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

About the author

Walker Rowe

Walker Rowe

Walker Rowe is a freelance tech writer and programmer. He specializes in big data, analytics, and programming languages. Find him on LinkedIn or Upwork.

Run and Reinvent Your Business with BMC

BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for six years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe. Learn more about BMC ›