ElasticSearch Sample Data

By | July 25, 2016

This ElasticSearch Sample Data is to be used for learning purpose only. It is randomly generated but still care has been taken to make it look like real world data.
UPDATE: A follow up to this post has been published. I will highly recommend that you have a look at it. I use Elasticsearch 7.2.0 and the data for a change is real. And also has GeoPoints!

WHY

Many people who are on path to learn ElasticSearch get stumped on this.
ElasticSearch Sample Data

Yes. Indeed where are large ElasticSearch Sample Data they can use to hone their ElasticSearch Kung-Fu? Well you are at right place.

HOW

Here is ElasticSearch Sample Data in form of two formatted json data files I created for myself for learning purposes.

Employees100K
Employees50K

One has records of 50000 employees while another one has 100000 employees.
Feel free to use these ElasticSearch Sample Data. However I assume no responsibility for any damage that might/can/will/should result from that. 🙂

To newbies here are the steps to load data to your ElasticSearch cluster:
1–Download curl. I am using linux which usually has curl.
2–Download and extract the data files.
3–Run these commands to load the data. First command creates an index with right mapping. The second one loads data. Might take some time.

curl -XPUT 'localhost:9200/companydatabase?pretty' -H 'Content-Type: application/json' -d' {"mappings" : { "employees" : { "properties" : { "FirstName" : { "type" : "text" }, "LastName" : { "type" : "text" }, "Designation" : { "type" : "text" }, "Salary" : { "type" : "integer" }, "DateOfJoining" : { "type" : "date", "format": "yyyy-MM-dd" }, "Address" : { "type" : "text" }, "Gender" : { "type" : "text" }, "Age" : { "type" : "integer" }, "MaritalStatus" : { "type" : "text" }, "Interests" : { "type" : "text" }}}}}' 
curl -XPUT 'localhost:9200/companydatabase/_bulk' --data-binary @Employees50K.json

4–Access the url http://localhost:9200/companydatabase/_count?pretty=1 to check if the data is there or not.

{
  "count" : 50000,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  }
}

Some words about data
Though it is random generated but still I have tried to keep lot of structure in it.
There is one CEO.
The President, Vice President, Delivery managers, Managers, Architects, HR Managers, Team lead, Senior Software Engineers, Software Engineers and Trainees, all follow a secret ratio (approximate one that is). Do post in comments sections if you find that. And any other interesting tidbits like male managers who have pole dancing as hobby. 🙂

The JSON format of data is like this

{
    "FirstName": "JOYE",
    "LastName": "WIATR",
    "Designation": "CEO",
    "Salary": 144000,
    "DateOfJoining":"25/05/2009"
    "Address": "9068 SW. Grove St. Waynesboro, PA 17268",
    "Gender": "Female",
    "Age": 58,
    "MaritalStatus": "Unmarried",
    "Interests": "Renting movies,Scuba Diving,Snowboarding,Butterfly Watching,Dumpster Diving,Badminton,Church/church activities"
  }

Giving credit where it is due
I used this website to generate the random addresses.
I used this website to generate the list of hobbies.
Source of Male first names is this.
Source of Female first names is this.
Source of Surnames is this.

I will post the program I wrote to generate the random data once I clean it up and it is presentable.
Meanwhile if you want smaller ElasticSearch Sample Data sets then give me a shout. I will generate and put that up.

2 I have hardcoded the index name to companydatabase and type name to employees btw. Sorry about that. You can change that in any editor

10 thoughts on “ElasticSearch Sample Data

  1. Greg

    Maybe add a StartDate date field (yyyy-mm-dd); could open demo possibilities for timeline style queries like “what is the most popular hiring month?” etc.

    Reply
    1. Pankaj K Post author

      Sorry for late response. It is not that I did not read the comment but just that I was too busy to work on that. However that being said I have updated the sample data and you can use it if you still need it.

      As a bonus I have made the month of joining of most of the people coincide with the most popular months of switching jobs. See if your kibana visualisations can pick that up. 🙂

      Reply
  2. Miko

    Very helpful – I was just looking for exactly this type of data. Thanks a lot! PS. time based fields and timestamps would be a good extension

    Reply
  3. toni

    Hey, great article
    I try to run exactly the commands but I’m getting an errors like:
    curl: (6) Could not resolve host: ‘localhost
    curl: (6) Could not resolve host: application
    curl: (3) [globbing] unmatched brace in column 1

    can you please help me?
    Thanks

    Reply
    1. Pankaj K Post author

      My machine was Linux. Are you on windows? If so then the single quotes around the command should be replaced by double quotes.

      Reply
  4. chandu

    i am getting this error,can you please help

    [root@ip-#####~]curl -H ‘Content-Type: application/x-ndjson’ -XPOST ‘http://ipaddr:9200/_bulk?pretty’ –data-binary @offers.json

    O/P

    {
    “error” : {
    “root_cause” : [
    {
    “type” : “json_e_o_f_exception”,
    “reason” : “Unexpected end-of-input: expected close marker for Object (start marker at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@7c53f523; line: 1, column: 1])\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@7c53f523; line: 1, column: 3]”
    }
    ],
    “type” : “json_e_o_f_exception”,
    “reason” : “Unexpected end-of-input: expected close marker for Object (start marker at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@7c53f523; line: 1, column: 1])\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@7c53f523; line: 1, column: 3]”
    },
    “status” : 500

    Reply
    1. Pankaj K Post author

      I am a bit short on time actually. I will try to have a look but can’t promise anything.
      This article needs to be updated for Elasticsearch 7.5. I think there are some changes which might trip the users. I will try to update this for latest version of Elasticsearch.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.