This dataset is derived from a dump of StackOverflow posts downloaded on June 10th 2016 from https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Posts.7z
Each question and answer have formatted into a JSON document with the following fields:
questionId: a unique ID for a question
answerId: a unique ID for an answer
acceptedAnswerId: the unique ID of the answer accepted for question
title: a free-text field with the question title
creationDate: The date the questions was asked
user: The user's unique ID
tags: An array of tags describing the technologies.
body: Field contsaining the text of the question or answer.
type: Type of post. Either 'question' or 'answer'
Fields that do not have values have been left out. The body has had text extracted and been formatted to fit into JSON documents.
Data preparation process:
- Question and answer entries in the original posts.XML were converted to slimmed-down JSON documents.
- No enrichment was performed. These scripts are available in the raw_data_prep_script.zip file.
{
"user": "45",
"tags": ["c#", "linq", ".net-3.5"],
"questionId": "59",
"creationDate": "2008-08-01T13:14:33.797",
"title": "How do I get a distinct, ordered list of names from a DataTable using LINQ?",
"acceptedAnswerId": "43110",
"type": "question",
"body": "Let's say I have a DataTable with a Name column. I want to have a collection of the unique names ordered alphabetically. The following query ignores the order by clause. var names = (from DataRow dr in dataTable.Rows orderby (string)dr[\"Name\"] select (string)dr[\"Name\"]).Distinct(); Why does the orderby not get enforced? "
}
This workload allows the following parameters to be specified using --workload-params
:
bulk_size
(default: 5000)bulk_indexing_clients
(default: 4): Number of clients that issue bulk indexing requests.ingest_percentage
(default: 100): A number between 0 and 100 that defines how much of the document corpus should be ingested.number_of_replicas
(default: 0)number_of_shards
(default: 5)query_cache_enabled
(default: false)requests_cache_enabled
(default: false)source_enabled
(default: true): A boolean defining whether the_source
field is stored in the index.index_settings
: A list of index settings. Index settings defined elsewhere (e.g.number_of_replicas
) need to be overridden explicitly.cluster_health
(default: "green"): The minimum required cluster health.error_level
(default: "non-fatal"): Available for bulk operations only to specify ignore-response-error-level.
We use the same license for the data as the original data: CC-SA-3.0