Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

sql: v24.3.0: index out of range in processSketchRow when collecting table stats #137386

Closed
cockroach-sentry opened this issue Dec 13, 2024 · 6 comments · Fixed by #138358
Closed
Assignees
Labels
branch-release-25.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner

Comments

@cockroach-sentry
Copy link
Collaborator

cockroach-sentry commented Dec 13, 2024

This issue was auto filed by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry Link: https://cockroach-labs.sentry.io/issues/6137369904/?referrer=webhooks_plugin

Panic Message:

panic.go:770: runtime error: index out of range [8192] with length 8192
(1) attached stack trace
  -- stack trace:
  | runtime.gopanic
  | 	GOROOT/src/runtime/panic.go:770
  | github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Wait
  | 	pkg/sql/flowinfra/flow.go:609
  | runtime.gopanic
  | 	GOROOT/src/runtime/panic.go:770
  | runtime.goPanicIndex
  | 	GOROOT/src/runtime/panic.go:114
  | github.com/axiomhq/hyperloglog.(*registers).get
  | 	external/com_github_axiomhq_hyperloglog/registers.go:80
  | github.com/axiomhq/hyperloglog.(*Sketch).Merge
  | 	external/com_github_axiomhq_hyperloglog/hyperloglog.go:157
  | github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*sampleAggregator).processSketchRow
  | 	pkg/sql/rowexec/sample_aggregator.go:398
  | github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*sampleAggregator).mainLoop
  | 	pkg/sql/rowexec/sample_aggregator.go:350
  | github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*sampleAggregator).Run
  | 	pkg/sql/rowexec/sample_aggregator.go:197
  | github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Run
  | 	pkg/sql/flowinfra/flow.go:574
  | github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run
  | 	pkg/sql/distsql_running.go:924
  | github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).planAndRunCreateStats
  | 	pkg/sql/distsql_plan_stats.go:778
  | github.com/cockroachdb/cockroach/pkg/sql.(*createStatsResumer).Resume.func1
  | 	pkg/sql/create_stats.go:760
  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).Txn.func1
  | 	pkg/sql/internal.go:1937
  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn.func4
  | 	pkg/sql/internal.go:2024
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  | 	pkg/kv/txn.go:1052
  | github.com/cockroachdb/cockroach/pkg/kv.runTxn
  | 	pkg/kv/db.go:1098
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl
  | 	pkg/kv/db.go:1061
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  | 	pkg/kv/db.go:1036
  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn
  | 	pkg/sql/internal.go:2011
  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).Txn
  | 	pkg/sql/internal.go:1938
  | github.com/cockroachdb/cockroach/pkg/sql.(*createStatsResumer).Resume
  | 	pkg/sql/create_stats.go:710
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2
  | 	pkg/jobs/registry.go:1639
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
  | 	pkg/jobs/registry.go:1640
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
  | 	pkg/jobs/adopt.go:446
  | github.com/cockroachdb/cockroach/pkg/jobs.(*StartableJob).Start.func2
  | 	pkg/jobs/jobs.go:832
  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
  | 	pkg/util/stop/stopper.go:498
  | runtime.goexit
  | 	src/runtime/asm_amd64.s:1695
Wraps: (2) runtime error: index out of range [8192] with length 8192
Error types: (1) *withstack.withStack (2) runtime.boundsError
-- report composition:
runtime.boundsError
panic.go:770: *withstack.withStack (top exception)
Stacktrace (expand for inline code snippets):

src/runtime/asm_amd64.s#L1694-L1696
pkg/util/stop/stopper.go#L497-L499
pkg/jobs/jobs.go#L831-L833
pkg/jobs/adopt.go#L445-L447
pkg/jobs/registry.go#L1639-L1641
pkg/jobs/registry.go#L1638-L1640
pkg/sql/create_stats.go#L709-L711
pkg/sql/internal.go#L1937-L1939
pkg/sql/internal.go#L2010-L2012
pkg/kv/db.go#L1035-L1037
pkg/kv/db.go#L1060-L1062
pkg/kv/db.go#L1097-L1099
pkg/kv/txn.go#L1051-L1053
pkg/sql/internal.go#L2023-L2025
pkg/sql/internal.go#L1936-L1938
pkg/sql/create_stats.go#L759-L761
pkg/sql/distsql_plan_stats.go#L777-L779
pkg/sql/distsql_running.go#L923-L925
pkg/sql/flowinfra/flow.go#L573-L575
pkg/sql/rowexec/sample_aggregator.go#L196-L198
pkg/sql/rowexec/sample_aggregator.go#L349-L351
pkg/sql/rowexec/sample_aggregator.go#L397-L399
external/com_github_axiomhq_hyperloglog/hyperloglog.go#L156-L158
external/com_github_axiomhq_hyperloglog/registers.go#L79-L81
GOROOT/src/runtime/panic.go#L113-L115
GOROOT/src/runtime/panic.go#L769-L771
pkg/sql/flowinfra/flow.go#L608-L610
GOROOT/src/runtime/panic.go#L769-L771

src/runtime/asm_amd64.s in runtime.goexit at line 1695
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 498
pkg/jobs/jobs.go in pkg/jobs.(*StartableJob).Start.func2 at line 832
pkg/jobs/adopt.go in pkg/jobs.(*Registry).runJob at line 446
pkg/jobs/registry.go in pkg/jobs.(*Registry).stepThroughStateMachine at line 1640
pkg/jobs/registry.go in pkg/jobs.(*Registry).stepThroughStateMachine.func2 at line 1639
pkg/sql/create_stats.go in pkg/sql.(*createStatsResumer).Resume at line 710
pkg/sql/internal.go in pkg/sql.(*InternalDB).Txn at line 1938
pkg/sql/internal.go in pkg/sql.(*InternalDB).txn at line 2011
pkg/kv/db.go in pkg/kv.(*DB).Txn at line 1036
pkg/kv/db.go in pkg/kv.(*DB).TxnWithAdmissionControl at line 1061
pkg/kv/db.go in pkg/kv.runTxn at line 1098
pkg/kv/txn.go in pkg/kv.(*Txn).exec at line 1052
pkg/sql/internal.go in pkg/sql.(*InternalDB).txn.func4 at line 2024
pkg/sql/internal.go in pkg/sql.(*InternalDB).Txn.func1 at line 1937
pkg/sql/create_stats.go in pkg/sql.(*createStatsResumer).Resume.func1 at line 760
pkg/sql/distsql_plan_stats.go in pkg/sql.(*DistSQLPlanner).planAndRunCreateStats at line 778
pkg/sql/distsql_running.go in pkg/sql.(*DistSQLPlanner).Run at line 924
pkg/sql/flowinfra/flow.go in pkg/sql/flowinfra.(*FlowBase).Run at line 574
pkg/sql/rowexec/sample_aggregator.go in pkg/sql/rowexec.(*sampleAggregator).Run at line 197
pkg/sql/rowexec/sample_aggregator.go in pkg/sql/rowexec.(*sampleAggregator).mainLoop at line 350
pkg/sql/rowexec/sample_aggregator.go in pkg/sql/rowexec.(*sampleAggregator).processSketchRow at line 398
external/com_github_axiomhq_hyperloglog/hyperloglog.go in github.com/axiomhq/hyperloglog.(*Sketch).Merge at line 157
external/com_github_axiomhq_hyperloglog/registers.go in github.com/axiomhq/hyperloglog.(*registers).get at line 80
GOROOT/src/runtime/panic.go in runtime.goPanicIndex at line 114
GOROOT/src/runtime/panic.go in runtime.gopanic at line 770
pkg/sql/flowinfra/flow.go in pkg/sql/flowinfra.(*FlowBase).Wait at line 609
GOROOT/src/runtime/panic.go in runtime.gopanic at line 770

Tags

Tag Value
Command server
Environment v24.3.0
Go Version go1.22.8 X:nocoverageredesign
Platform linux amd64
Distribution CCL
Cockroach Release v24.3.0
Cockroach SHA ef2ebe9
# of CPUs 4
# of Goroutines 714

Jira issue: CRDB-45552

@cockroach-sentry cockroach-sentry added branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Dec 13, 2024
Copy link

blathers-crl bot commented Dec 13, 2024

CC'ing via the CODEOWNERS-based sentry heuristic:

  • @cockroachdb/sql-queries

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner labels Dec 13, 2024
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Dec 13, 2024
@mgartner mgartner self-assigned this Dec 13, 2024
@mgartner
Copy link
Collaborator

I'm looking into this now—temporarily assigning myself.

@mgartner
Copy link
Collaborator

I spent a while trying to repro this out-of-bounds error in our hyperloglog library, without success. It seems like this lenfht of cpOther.regs.tailcuts here can exceed 8192, but I don't see how that is possible: https://github.com/axiomhq/hyperloglog/blob/4b99d0c2c99ec77eb3a42344d206a88997957495/hyperloglog.go#L157

I thought there may be a bug in the marshalling/unmarshalling of the hyperloglog.Sketch, but it seems mostly sane.

I think there's a few potential action items for us:

  1. Consider upgrading hyperlogog. We're on a very old very. Newer versions seem to have eliminated the tailcuts concept entirely, so the bug we're hitting here might be fixed.
  2. Add a panic-catcher for the sampler so this doesn't crash a node.
  3. Add some additional logging or assertions that may catch a bad hyperloglog sketch before this out-of-bounds occurs.

@mgartner mgartner removed their assignment Dec 13, 2024
@yuzefovich yuzefovich changed the title Sentry: panic.go:770: runtime error: index out of range [8192] with length 8192 (1) attached stack trace -- stack trace: | runtime.gopanic | GOROOT/src/runtime/panic.go:770 | github.com/cockr... sql: v24.3.0: index out of range in processSketchRow when collecting table stats Dec 14, 2024
@mgartner mgartner moved this from Triage to 25.2 Release in SQL Queries Dec 26, 2024
@mgartner mgartner moved this from 25.2 Release to 25.1 Release in SQL Queries Dec 26, 2024
@yuzefovich
Copy link
Member

This appears to be a regression in 24.3 version - in #137749 we have 4 occurrences of this problem and all were running 24.3.x. Perhaps a RESTORE of a backup from earlier version is required for reproducing.

@yuzefovich
Copy link
Member

yuzefovich commented Jan 7, 2025

I'm able to reproduce this problem with the following steps:

  • start n1 using v24.3.2, do cockroach init
  • start n2 using master with COCKROACH_TESTING_FORCE_RELEASE_BRANCH=true override
  • connect to n1
create table t (k int primary key);
insert into t select generate_series(1, 100000);
alter table t split at values (50000);
alter table t experimental_relocate values (array[1], 0), (array[2], 50000);
analyze t;

and boom. On a quick glance, the coordinator node must be running older version (i.e. doing these steps when connected to n2 doesn't trigger the problem).

I believe the issue is the following: in 2c036cf (which is only present on master, i.e. 25.1 version) we upgraded the hyperloglog library. We marshal sketches via the binary representation, but nodes running the older version do not know how to unmarshal the newer representation.

The only remaining question for me is why we're seeing this problem in sentry (the library bump wasn't backported) - perhaps they are the result of the same backup roachtest failures? Or someone is doing their testing with master version and mixed version state?

@yuzefovich yuzefovich added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-25.1 and removed branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 labels Jan 7, 2025
@yuzefovich yuzefovich self-assigned this Jan 7, 2025
@yuzefovich yuzefovich moved this from 25.1 Release to Active in SQL Queries Jan 7, 2025
@mgartner
Copy link
Collaborator

mgartner commented Jan 8, 2025

@yuzefovich Great job tracking this down! I incorrectly assumed that the cluster was on 24.3 based on the version of the Sentry report. I did not think about the mixed-version case.

@craig craig bot closed this as completed in e207d7e Jan 9, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
branch-release-25.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants