Machine Learning & Big Data Blog

ElasticSearch Nested Queries: How to Search for Embedded Documents

Walker Rowe
by Walker Rowe
3 minute read

ElasticSearch is annoyingly complicated at times. You can run a search and it runs the wrong results and you are not made aware of that.

This can happen when, for example, you have a nested JSON document, i.e., one JSON document inside another. This is because Lucene (i.e., ElasticSearch) query has no understanding of object hierarchy in a JSON document.

The Problem with Searching for nested JSON objects

To illustrate the problem and the solution, download this program massAdd.py and change the URL to match your ElasticSearch environment. Then run it.

Then look at loaded data. You can see from the brackets that classes is a JSON array. But the index, as we will see, does not reflect that.

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search?pretty}

"_index" : "universities",
        "_type" : "universities",
        "_id" : "428b71b0b935eaf58bdf874c819f263a919da5f3",
        "_score" : 0.0,
        "_source" : {
          "firstName" : "Stephen",
          "classes" : [
            {
              "name" : "French",
              "grades" : 2
            },
            {
              "name" : "French",
              "grades" : 3
            },
            {
              "name" : "French",
              "grades" : 2
            },
            {
              "name" : "physics",
              "grades" : 6
            },
            {
              "name" : "physics",
              "grades" : 5
            }
          ],
          "lastName" : "Shakespeare",
          "school" : "Arizona State University"
        }
      }
    ]
  }

Notice the index mapping does not show the JSON array. This is not because there are no brackets []. Rather the nested word is missing.

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities?pretty

...

"classes" : {
            "properties" : {
              "grades" : {
                "type" : "long"
              },
              "name" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          },
...

So run a search using what you would think would the logical way to find students taking physics who got a grade of 1:

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search/?pretty=true -d '{
    "query": {
                "bool" : {
                    "must" : [
                    { "match" : {"classes.name" : "physics"} },
                    { "match" : {"classes.grades" :  1} }
                    ]
            }
        }
}'

Produces this incorrect result. This is because Lucene flattens the whole document and finds both a classes.grades 1 and classes.name physics.

{
        "_index" : "universities",
        "_type" : "universities",
        "_id" : "5846c01f90d80448ab087bf4f476cd3f8ab6f683",
        "_score" : 1.621972,
        "_source" : {
          "classes" : [
            {
              "grades" : 2,
              "name" : "physics"
            },
            {
              "grades" : 2,
              "name" : "French"
            },
            {
              "grades" : 3,
              "name" : "physics"
            },
            {
              "grades" : 1,
              "name" : "math"
            }
          ],
          "firstName" : "Stephen",
          "lastName" : "Rowe",
          "school" : "University of South Carolina - Columbia"
        }
      }

Making the Index Explicit

You can create an index in ElasticSearch just by loading data. That creates it on-the-fly. But you cannot control the way the index is created if you do that. So let’s make the index creation explicit and make clear that classes is a JSON array.

First delete the index, which will also delete all the data.

curl -XDELETE http://parisx:9200/universities

Then create a new index and make it explicit that classes is an object inside of the universities object by using “type” : “nested”. Note: that we are not filling in university data, just students. So ignore those fields. (We will use university data in another post and used it in the previous post).

curl -XPUT --header 'Content-Type: application/json' http://parisx:9200/universities -d '{
    "mappings" : {
      "universities" : {
        "properties" : {
          "Address" : { "type" : "text"},
          "AdminEmail" : { "type" : "text"},
          "AdminName" : { "type" : "text" },
          "AdminPhone" : { "type" : "text" },
          "DapipId" : { "type" : "text" },
          "Fax" : { "type" : "text" },
          "GeneralPhone" : { "type" : "text" },
          "LocationName" : { "type" : "text" },
          "LocationType" : { "type" : "text" },
          "OpeId" : { "type" : "text" },
          "ParentDapipId" : { "type" : "text" },
          "ParentName" :  { "type" : "text" },
          "UpdateDate" : { "type" : "text" },
          "classes" : {
                "type" : "nested",
                "properties" : {
                   "name" : { "type" : "text"},
                   "grades" : { "type" : "integer" }
              }
            },
        "firstName" : { "type" : "text" },
          "lastName" : { "type" : "text" },
          "school" : { "type" : "text" }
            }
          }
        }
}'

Then load the data again by running massAdd.py again:

Now add the terms nested and path to the query to find students who got a grade of 1 in physics. The path is the parent object in the JSON, which in this case is classes. And use classes.name to search by class name:

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search/?pretty=true -d '{
    "query": {
        "nested" : {
            "path" : "classes",
            "score_mode" : "avg",
            "query" : {
                "bool" : {
                "must" : [
                    { "match" : {"classes.name" : "physics"} },
                    { "match" : {"classes.grades" : 1} }
                    ]
                }
            }
        }
    }
}'

results in the correct result.

{
        "_index" : "universities",
        "_type" : "universities",
        "_id" : "20ce682e16d0b88487addc9ceba2a695d57692aa",
        "_score" : 2.188588,
        "_source" : {
          "lastName" : "Shakespeare",
          "firstName" : "Julie",
          "classes" : [
            {
              "grades" : 1,
              "name" : "physics"
            },
            {
              "grades" : 6,
              "name" : "math"
            }
          ],
          "school" : "Arizona State University"
        }
      }
    ]

Wikibon: Automate your Big Data pipeline

Learn how data management experts throughout the industry are transforming their Big Data infrastructure for maximum business impact.
Download Now ›

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

About the author

Walker Rowe

Walker Rowe

Walker Rowe is an American freelance tech writer and programmer living in Tunisia. He specializes in big data, analytics, and programming languages. Find him on LinkedIn or Upwork.