Missing vs. Empty Fields

The topic of empty-ish fields comes up often because:
- input fields weren't required and were skipped (e.g. optional input fields in signup forms)
- because of human error (e.g. data entry "interns" forgot to fill out a form field)
- and last but not least, because of purposefully set empty-ish values (
null
,[]
,{}
,""
)
In Elasticsearch, these values would indicate that a given field does exist:
- Empty strings, such as
""
or"-"
- Arrays containing
null
and another value, such as[null, "foo"]
- A custom
null-value
, defined in field mapping (discussed later on)
When a doc is inserted into ES, its fields are usually indexed. The indexed value of one of those fields may not exist due to a variety of reasons:
- The field is simply not present in the source JSON
- The field in the source JSON is null —
null
or[]
or[null, null, ...?]
or{}
- The field has
"index" : false
set in the mapping - The field value was malformed and the
ignore_malformed
parameter was defined in the index mapping
These statements might still be confusing so let's look at a concrete example.
Given the following documents that contain different data types, any of which can be emptyish or even absent (non-existent):
POST myindex/_doc/1
{
"id": 1
}
POST myindex/_doc/2
{
"id": 2,
"potentially_absent_field": null
}
POST myindex/_doc/3
{
"id": 3,
"potentially_absent_field": 123
}
I want to find documents that strictly lack the potentially_absent_field
, i.e. only doc#1.
At the first glance it looks like we could reverse the exists
query and thus target only those docs that do not contain the potentially_absent_field
:
POST myindex/_search
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "potentially_absent_field"
}
}
]
}
}
}
Somewhat unexpectedly, this returns both #1 and #2! The reason for that is that doc#2's potentially_absent_field
cannot be indexed or searched. It is treated as though it has no values and is, for all intents and purposes, considered a non-existent field — just like the one in doc#1.
Having said that, we essentially have two ways to solve this:
We need an explicit placeholder value that we can target in the query in order to filter out the docs whose potentially_absent_field
is present but null
.
The mapping parameter null_value
enables us to do just that. Its value can be arbitrary but it needs to be of the same data type as the field itself. Since we're dealing with an integer field, we'll pick -1
and update the mapping as follows:
PUT myindex/_mapping
{
"properties": {
"potentially_absent_field": {
"type": "long",
"null_value": -1
}
}
}
Keep in mind that this only influences how the field is indexed and not its original value (i.e. _source
that would be retrieved in a search request).
Since this is a "static" mapping update, the already indexed documents won't be affected. So in order for this change to propagate onto the existing docs, we'll need to reindex.
The following trick forces the index to pick up on the new mapping change without any _reindex
calls: