Loading

Create readable and maintainable ingest pipelines

Stack Serverless

There are many ways to achieve similar results when creating ingest pipelines, which can make maintenance and readability difficult. This guide outlines patterns you can follow to make the maintenance and readability of ingest pipelines easier without sacrificing functionality.

Note

This guide does not provide guidance on optimizing for ingest pipeline performance.

When creating ingest pipelines, there are are few options for accessing fields in conditional statements and scripts. All formats can be used to reference fields, so choose the one that makes your pipeline easier to read and maintain.

Notation Example Notes
Dot notation ctx.event.action Supported in conditionals and painless scripts.
Square bracket notation ctx['event']['action'] Supported in conditionals and painless scripts.
Mixed dot and bracket notation ctx.event['action'] Supported in conditionals and painless scripts.
Field API Stack Planned field('event.action', '') or $('event.action','') Supported in conditionals and painless scripts.
Field API field('event.action', '') or $('event.action','') Supported only in painless scripts.

Below are some general guidelines for choosing the right option in a situation.

Stack Planned

The field API can be used in conditionals (the if statement of your processor) in addition to being used in the script processor itself.

Note

This is the preferred way to access fields.

Benefits

  • Clean and easy to read.
  • Handles null values automatically.
  • Adds support for additional functions like isEmpty() to ease comparisons.
  • Handles dots as part of field name.
  • Handles dots as dot walking for object notation.
  • Handles special characters.

Limitations

  • Not available in all versions for conditionals.

Benefits

Limitations

  • Does not support field names that contain a . or any special characters such as @. Use Bracket notation instead.

Benefits

  • Supports special characters such as @ in the field name. For example, if there's a field name called has@!%&chars, you would use ctx['has@!%&chars'].
  • Supports field names that contain .. For example, if there's a field named foo.bar, if you used ctx.foo.bar it will try to access the field bar in the object foo in the object ctx. If you used ctx['foo.bar'] it can access the field directly.

Limitations

  • Slightly more verbose than dot notation.
  • No support for null safe operations ?. Use Dot notation instead.

Benefits

  • You can also mix dot notation and bracket notation to take advantage of the benefits of both formats. For example, you could use ctx.my.nested.object['has@!%&chars']. Then you can use the ? operator on the fields using dot notation while still accessing a field with a name that contains special characters: ctx.my?.nested?.object['has@!%&chars'].

Limitations

  • Slightly more difficult to read.

Use conditionals (if statements) to ensure that an ingest pipeline processor is only applied when specific conditions are met.

Anticipate potential problems with the data, and use the null safe operator (?.) to prevent data from being processed incorrectly.

Tip

It is not necessary to use a null safe operator for first level objects (for example, use ctx.openshift instead of ctx?.openshift). ctx will only ever be null if the entire _source is empty.

For example, if you only want data that has a valid string in a ctx.openshift.origin.threadId field:

ctx.openshift.origin != null
&& ctx.openshift.origin.threadId != null
  1. It's unnecessary to check both openshift.origin and openshift.origin.threadId.
  2. This will fail if openshift is not properly set because it assumes that ctx.openshift and ctx.openshift.origin both exist.
ctx.openshift?.origin?.threadId instanceof String
  1. Only if there's a ctx.openshift and a ctx.openshift.origin will it check for a ctx.openshift.origin.threadId and make sure it is a string.

If you're using a null safe operator, it will return the value if it is not null so there is no reason to check whether a value is not null before checking the type of that value.

For example, if you only want data when the value of the ctx.openshift.origin.eventPayload field is a string:

ctx?.openshift?.eventPayload != null && ctx.openshift.eventPayload instanceof String
ctx.openshift?.eventPayload instanceof String

When using the boolean OR operator (||), you need to use the null safe operator for both conditions being checked.

For example, if you want to include data when the value of the ctx.event.type field is either null or '0':

ctx.event.type == null || ctx.event.type == '0'
  1. This will fail if ctx.event is not properly set because it assumes that ctx.event exists. If it fails on the first condition it won't even try the second condition.
ctx.event?.type == null || ctx.event?.type == '0'
  1. Both conditions will be checked.

It is often unnecessary to use the null safe operator (.?) multiple times when you have already traversed the object path.

For example, if you're checking the value of two different child properties of ctx.arbor.ddos:

ctx.arbor?.ddos?.subsystem == 'CLI' && ctx.arbor?.ddos?.command_line != null
ctx.arbor?.ddos?.subsystem == 'CLI' && ctx.arbor.ddos.command_line != null
  1. Since the if condition is evaluated left to right, once ctx.arbor?.ddos?.subsystem == 'CLI' passes, you know ctx.arbor.ddos exists so you can safely omit the second ?.

When checking if a field is not empty, avoid redundant null safe operators and use clear, concise conditions.

ctx?.user?.geo?.region != null && ctx?.user?.geo?.region != ''

Once you've checked ctx.user?.geo?.region != null, you can safely access ctx.user.geo.region in the next condition.

ctx.user?.geo?.region != null && ctx.user.geo.region != ''

To check if a string field is not empty, use the isEmpty() method in your condition. For example:

ctx.user?.geo?.region instanceof String && ctx.user.geo.region.isEmpty() == false
  1. This ensures the field exists, is a string, and is not empty.
Tip

For such checks you can also omit the instanceof String and use an Elvis such as if: ctx.user?.geo?.region?.isEmpty() ?: false. This will only work when region is a String. If it is a double, object, or any other type that does not have an isEmpty() function, it will fail with a Java Function not found error.

When using the boolean OR operator (||), if conditions can become unnecessarily complex and difficult to maintain, especially when chaining many OR checks. Instead, consider using array-based checks like .contains() to simplify your logic and improve readability.

"if": "ctx?.kubernetes?.container?.name == 'admin' || ctx?.kubernetes?.container?.name == 'def'
|| ctx?.kubernetes?.container?.name == 'demo' || ctx?.kubernetes?.container?.name == 'acme'
|| ctx?.kubernetes?.container?.name == 'wonderful'
["admin","def","demo","acme","wonderful"].contains(ctx.kubernetes?.container?.name)
Tip

This example only checks for exact matches. Do not use this approach if you need to check for partial matches.

When working with data sizes, store all values as bytes (using a long type) in Elasticsearch. This ensures consistency and allows you to leverage advanced formatting in Kibana Data Views to display human-readable sizes.

Avoid chaining several gsub processors to strip units and manually convert values. This approach is error-prone, hard to maintain, and can easily miss edge cases.

{
  "gsub": {
    "field": "document.size",
    "pattern": "M",
    "replacement": "",
    "ignore_missing": true,
    "if": "ctx?.document?.size != null && ctx.document.size.endsWith(\"M\")"
  }
},
{
  "gsub": {
    "field": "document.size",
    "pattern": "(\\d+)\\.(\\d+)G",
    "replacement": "$1$200",
    "ignore_missing": true,
    "if": "ctx?.uws?.size != null && ctx.document.size.endsWith(\"G\")"
  }
},
{
  "gsub": {
    "field": "document.size",
    "pattern": "G",
    "replacement": "000",
    "ignore_missing": true,
    "if": "ctx?.uws?.size != null && ctx.document.size.endsWith(\"G\")"
  }
}

The bytes processor automatically parses and converts strings like "100M" or "2.5GB" into their byte values. This is more reliable, easier to maintain, and supports a wide range of units.

POST _ingest/pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "document": {
          "size": "100M"
        }
      }
    }
  ],
  "pipeline": {
    "processors": [
      {
        "bytes": {
          "field": "document.size"
        }
      }
    ]
  }
}
Tip

After storing values as bytes, you can use Kibana's field formatting to display them in a human-friendly format (KB, MB, GB, etc.) without changing the underlying data.

The rename processor renames a field. There are two flags:

  • ignore_missing: Useful when you are not sure that the field you want to rename exists.
  • ignore_failure: Helps with any failures encountered. For example, the rename processor can only rename to non-existing fields. If you already have the field abc and you want to rename def to abc, the operation will fail.

If no built-in processor can achieve your goal, you may need to use a script processor in your ingest pipeline. Be sure to write scripts that are clear, concise, and maintainable.

All of the above discussed ways to access fields and retrieve their values is applicable within the script context. Null handling is still an important aspect when accessing the fields.

Tip

The fields API is the recommended way to add new fields.

For example, add a new system.cpu.total.norm.pct field based on the value of the cpu.usage field. The value of the existing cpu.usage field is a number on a scale of 0-100. The value of the new system.cpu.total.norm.pct field will be on a scale from 0-1.0 where 1 is the equivalent of 100 in the cpu.usage field.

Option 1: Fields API (preferred) Create a new system.cpu.total.norm.pct field and set the value to the value of the cpu.usage field divided by 100.0.

POST _ingest/pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "cpu": {
          "usage": 90
        }
      }
    }
  ],
  "pipeline": {
    "processors": [
      {
        "script": {
          "source": """
            field('system.cpu.total.norm.pct').set($('cpu.usage',0.0)/100.0)
          """
        }
      }
    ]
  }
}
  1. This field expects 0-1 and not 0-100. When renaming the field, divide this value by 100 to get the correct value.
  2. The field API is exposed as field(<field name>). The set(<value>) is responsible for setting the value. Inside we use the $(<field name>, fallback) to read the value out of the existing field. Lastly we divide by 100.0. The .0 is important, otherwise it will perform an integer only division and return just 0 instead of 0.9.

Option 2: Without the fields API Without the field API, there is much more code involved to ensure that you can walk the full path of system.cpu.total.norm.pct.

{
  "script": {
    "source": "
      if(ctx.system == null){
        ctx.system = new HashMap();
      }
      if(ctx.system.cpu == null){
        ctx.system.cpu = [:];
      }
      if(ctx.system.cpu.total == null){
        ctx.system.cpu.total = [:];
      }
      if(ctx.system.cpu.total.norm == null){
        ctx.system.cpu.total.norm = [:];
      }
      ctx.system.cpu.total.norm.pct = $('cpu.usage', 0.0)/100.0;
      "
  }
}
  1. Check whether the objects are null or not and then create them.
  2. Create a new HashMap to store all the objects in it.
  3. Instead of writing new HashMap(), use the shortcut [:].
  4. Perform the same calculation as above and set the value.
{
  "script": {
    "source": """
       String timeString = ctx['temp']['duration'];
       ctx['event']['duration'] = Integer.parseInt(timeString.substring(0,2))*360000 + Integer.parseInt(timeString.substring(3,5))*60000 + Integer.parseInt(timeString.substring(6,8))*1000 + Integer.parseInt(timeString.substring(9,12)); <2> <3>
     """,
    "if": "ctx.temp != null && ctx.temp.duration != null"
  }
}
  1. Avoid accessing fields using square brackets instead of dot notation.
  2. ctx['event']['duration']: Do not attempt to access child properties without ensuring the parent property exists.
  3. timeString.substring(0,2): Avoid parsing substrings manually instead of leveraging date/time parsing utilities.
  4. event.duration should be in nanoseconds, as expected by ECS, instead of milliseconds.
  5. Avoid redundant null checks instead of the null safe operator (?.).

This approach is hard to read, error-prone, and doesn't take advantage of the powerful date/time features available in Painless.

POST _ingest/pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "temp": {
          "duration": "00:00:06.448"
        }
      }
    }
  ],
  "pipeline": {
    "processors": [
      {
        "script": {
          "source": """
             if (ctx.event == null) {
               ctx.event = [:];
             }
             DateTimeFormatter formatter = DateTimeFormatter.ofPattern("HH:mm:ss.SSS");
             LocalTime time = LocalTime.parse(ctx.temp.duration, formatter);
             ctx.event.duration = time.toNanoOfDay();
           """,
          "if": "ctx.temp?.duration != null"
        }
      }
    ]
  }
}
  1. Ensure the event object exists before assigning to it.
  2. Use DateTimeFormatter and LocalTime to parse the duration string.
  3. Store the duration in nanoseconds, as expected by ECS.
  4. Use the null safe operator to check for field existence.

When reconstructing or normalizing IP addresses in ingest pipelines, avoid unnecessary complexity and redundant operations.

{
  "script": {
    "source": """
        String[] ipSplit = ctx['destination']['ip'].splitOnToken('.');
        String ip = Integer.parseInt(ipSplit[0]) + '.' + Integer.parseInt(ipSplit[1]) + '.' + Integer.parseInt(ipSplit[2]) + '.' + Integer.parseInt(ipSplit[3]);
        ctx['destination']['ip'] = ip;
    """,
    "if": "(ctx['destination'] != null) && (ctx['destination']['ip'] != null)"
  }
}
  1. Uses square bracket notation for field access instead of dot notation.
  2. Unnecessary casting to Integer when parsing string segments.
  3. Allocates an extra variable for the IP string instead of setting the field directly.
  4. Does not check if destination is available as an object.
POST _ingest/pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "destination": {
          "ip": "192.168.0.1.3.4.5.6.4"
        }
      }
    }
  ],
  "pipeline": {
    "processors": [
      {
        "script": {
          "source": """
            def temp = ctx.destination.ip.splitOnToken('.');
            ctx.destination.ip = temp[0] + "." + temp[1] + "." + temp[2] + "." + temp[3];
          """,
          "if": "ctx.destination?.ip != null"
        }
      }
    ]
  }
}
  1. Uses dot notation for field access.
  2. Avoids unnecessary casting and extra variables.
  3. Uses the null safe operator (?.) to check for field existence.

This approach is more maintainable, avoids unnecessary operations, and ensures your pipeline scripts are robust and easy to understand.

It's a common mistake to explicitly remove the @timestamp field before running a date processor, as shown below:

{
  "set": {
    "field": "openshift.timestamp",
    "value": "{{openshift.date}} {{openshift.time}}",
    "if": "ctx?.openshift?.date != null && ctx?.openshift?.time != null && ctx?.openshift?.timestamp == null"
  }
},
{
  "remove": {
    "field": "@timestamp",
    "ignore_missing": true,
    "if": "ctx?.openshift?.timestamp != null || ctx?.openshift?.timestamp1 != null"
  }
},
{
  "date": {
    "field": "openshift.timestamp",
    "formats": [
      "yyyy-MM-dd HH:mm:ss",
      "ISO8601"
    ],
    "timezone": "Europe/Vienna",
    "if": "ctx?.openshift?.timestamp != null"
  }
}

This removal step is unnecessary and can even be counterproductive. The date processor will automatically overwrite the value in @timestamp with the parsed date from your source field, unless you explicitly set a different target_field. There's no need to remove @timestamp beforehand—the processor will handle updating it for you.

Removing @timestamp can also introduce subtle bugs, especially if the date processor is skipped or fails, leaving your document without a timestamp.

Mustache is a simple templating language used in Elasticsearch ingest pipelines to dynamically insert field values into strings. You can use double curly braces ({{ }}) to reference fields from your document, enabling flexible and dynamic value assignment in processors like set, rename, and others.

For example, {{host.hostname}} will be replaced with the value of the host.hostname field at runtime. Mustache supports accessing nested fields, arrays, and even provides some basic logic for conditional rendering.

When you need to reference a specific element in an array using Mustache templates, you can use dot notation with the zero-based index. For example, to access the first value in the tags array, use .0 after the array field name.

POST _ingest/pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "host": {
          "hostname": "abc"
        },
        "tags": [
          "cool-host"
        ]
      }
    }
  ],
  "pipeline": {
    "processors": [
      {
        "set": {
          "field": "host.alias",
          "value": "{{tags.0}}"
        }
      }
    ]
  }
}

In this example, {{tags.0}} retrieves the first element of the tags array ("cool-host") and assigns it to the host.alias field. This approach is necessary when you want to extract a specific value from an array for use elsewhere in your document. Using the correct index ensures you get the intended value, and this pattern works for any array field in your source data.

Whenever you need to store the original _source within a field event.original, use mustache function {{#toJson}}<field>{{/toJson}}.

POST _ingest/pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "foo": "bar",
        "key": 123
      }
    }
  ],
  "pipeline": {
    "processors": [
      {
        "set": {
          "field": "event.original",
          "value": "{{#toJson}}_source{{/toJson}}"
        }
      }
    ]
  }
}