Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[FEA] Support timestamp transitions to and from UTC for repeating rules (daylight savings time zones) #6840

Open
revans2 opened this issue Oct 18, 2022 · 5 comments
Assignees
Labels
feature request New feature or request

Comments

@revans2
Copy link
Collaborator

revans2 commented Oct 18, 2022

Is your feature request related to a problem? Please describe.
This is a follow on to #6831. In order to support all time zones we will need a way to be able to deal with repeating rules. These are the instances of ZoneOffsetTransitionRule. The idea is to that we would cache these on the CPU similar to what we do for ZoneOffsetTransition. However, we would need a lot more columns, because we would need to include the majority of the information included in the ZoneOffsetTransitionRule. Then we would need to recreate the same kind of logic that exists in ZoneRules.getOffset. But we have to be very careful. JDK code is under the a GPL license that makes it so you cannot copy it. Instead we are going to have to try and do a clean room implementation or we might be able to look at what joda time does, which is an Apache licensed library that also supports time zone conversions.

https://github.com/JodaOrg/joda-time

Specifically https://github.com/JodaOrg/joda-time/blob/main/src/main/java/org/joda/time/tz/DateTimeZoneBuilder.java and https://github.com/JodaOrg/global-tz

But I am not sure everything lines up exactly the same so it could take a bit of work.

After this we need to update the API that restricts which time zones we support so it includes all of them and deprecate it. Then we need to update all of the tests that have been written to check that we do not fall back and test the new time zones.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 18, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Oct 18, 2022
@NVnavkumar NVnavkumar self-assigned this Dec 1, 2023
@NVnavkumar NVnavkumar changed the title [FEA] Support timestamp transitions to and from UTC for repeating rules [FEA] Support timestamp transitions to and from UTC for repeating rules (daylight savings) Dec 8, 2023
@NVnavkumar
Copy link
Collaborator

After doing some research into the JDK's API, here are some interesting findings:

  1. Java stores the transitions for the timezones with daylight savings after they have already happened. So for past dates, the transition list has all the previous daylight savings transitions in the transition list as fixed transitions (which makes sense). This means that the transition list will actually grow each year for these timezones. Also, the implication is that for dates in the past, we would adopt the same approach we take with non-repeating rules.

  2. Transition rules exist for dates in the future. The idea is for each current and future year, you can compute the ZoneOffsetTransition for that future year.

So we have 2 approaches we can try here:

  1. We can use ZoneOffsetTransitionRule.createTransitition(...) to create transitions for a set of future years and append these to fixed transition list for that timezone. This would incur a longer initialization time and more memory required to store the extra transitions, but computing the conversion to and from UTC would simply use the existing kernel as if the transitions were fixed and non-repeating. This is definitely the simplest to implement and is parallel to the implementation that cuDF currently uses in parsing non-UTC timestamps from ORC files.

  2. We would store the rules in another column vector, and then implement an addition to the kernel that would use rules if we would fall out of the existing transition list (ie use a transition that doesn't exist yet). We would have to process the timestamp to figure out which rule would apply and then apply a created transition from that rule. The logic is more complicated in this case so it might not perform as well as the simpler fixed transition kernel since we would need some additional conditional logic in this case. But initialization time would only increase a nominal amount to store the rules, and the storage cost would only increase on the yearly basis that Java uses and for the additional columns to store rules.

For now, I'm inclined to try out (2) first since I think (1) feels more like too much additional cost for what looks like premature optimization. However, we could keep in mind to implement both approaches (or a combination) since it could be something that is tuned to the data involved. For example if we are working with a lot of future timestamps, (1) could be better since that algorithm is a lot cleaner than doing a lot of processing and extraction from each timestamp to determine if it falls under a specific rule. Also, if we can partially tune an implementation of (1), maybe we can cover more than enough with less overhead.

@NVnavkumar
Copy link
Collaborator

Should also add, at minimum for approach (2), we should ensure the transitions for the current year are already computed since that logic is potentially more complicated given the overlap between fixed transitions and choosing rules for a daylight savings cycle that hasn't yet completed. It might be much easier to start the following year with fresh rule choice at that point.

@NVnavkumar NVnavkumar changed the title [FEA] Support timestamp transitions to and from UTC for repeating rules (daylight savings) [FEA] Support timestamp transitions to and from UTC for repeating rules (daylight savings time zones) Dec 13, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Dec 27, 2023

I did some initial work looking at approach 1 when I filed the issue. The size of the database is not likely to be too horrible on a per time-zone basis. It would probably be about 11 MiB per time zone because it would need to go out to the year 294247 or so to fully support everything that Spark can also support.

294247 years * 2 transitions per year * (8 + 8 + 4 bytes) => 11.22 MiB

There are about 206 time zones that have transition rules

scala> java.time.ZoneId.getAvailableZoneIds.asScala.map(f => java.time.ZoneId.of(f).normalized).filter(f => !f.getRules().getTransitionRules.isEmpty()).size
res18: Int = 206

Which means we would need about 2.25 GiB to store all of the rules for those timezones. That is small enough That I don't think we would need to worry about the Column size limits in Spark, but it is large enough that I don't think we want to try and keep the entire thing cached in either CPU or GPU memory. So option 2 is the ideal solution if we can make it work. If we need to we could try and play some games where we load timezones on demand as they are needed, as we probably need only one or two timezone ever. We could also look for the maximum timestamp and load up to that point, as it should be very very uncommon to need all of the data. But that would open up a number of corner cases which I would not be super happy about. So if we can make the dynamic version work, then it would be awesome.

Also the dynamic version would set us up to be able to support things like fixed offset rules that don't actually have a name.

@res-life
Copy link
Collaborator

Not planing work on this for release 24.04
@sameerz can we move it to next release?

@revans2
Copy link
Collaborator Author

revans2 commented Jan 31, 2025

I do want to add a bit more details about this after thinking about it more.

I believe that we can write a GPU kernel that can do transitions to and from UTC using the data stored in the java ZoneOffsetTransitionRule.

That said I think the vast majority of the time any date/timestamp that we process will be within the next 100 or so years and from a performance perspective I think it is preferable to at least expand out rules for that range.

There are 206 distinct time zones in the java 11 database with rules that we cannot currently support. To support these rules for the next 127 years would only require 1 MiB of data. For a single time zone that is only about 5 KiB which should fit in the L1 cache for each warp even in a T4. CUDF assumes that all timestamps will be within 1970 + 400 years and just blindly loads transitions to that point.

That said if we want to just get something working we probably can add in some checks and do the conversion on the GPU for any timestamps that are within the range of rules that we support, and have cached. For any others we could do the conversion on the CPU and copy the result back to the GPU. This is making the assumption that these should be very rare. I'm not sure in practice if that is actually true.

In either case as a follow on we do want to be able to run these rules on the GPU directly. I asked ChatGPT and deepseek r1 to collaborate on some code. I have no idea how good it actually is, and I have not tested it, or even tried to compile it, but I hope it will give us a head start and avoid any GPL issues with the java standard library. Note that they don't really know how to deal with CUDF so it is not using table or column like they should, but the core parts are

// Data structure for transition rules
struct ZoneRule {
    int8_t month;             // 1-12
    int8_t day_of_week;       // 1=Monday, 7=Sunday
    int8_t day_type;          // -1=last, 0+=fixed day of month
    int32_t local_time;       // microseconds since midnight
    int32_t standard_offset;  // microseconds
    int32_t dst_offset;       // microseconds
};

// Expanded transition struct with merged rules and transitions
struct ZoneData {
    int64_t* transitions;      // Sorted transitions (microseconds)
    int32_t* offsets_before;    // Offset before transition
    int32_t* offsets_after;     // Offset after transition
    int transition_count;
    ZoneRule* rules;            // Only for out-of-bounds timestamps
    int rule_count;
};

__device__ cuda::std::chrono::sys_days last_weekday_of_month(
    int year, int month, cuda::std::chrono::weekday wd
) {
    using namespace cuda::std::chrono;
    year_month_day_last ymdl = year{year}/month{month}/last;
    sys_days last_day = sys_days{ymdl};
    weekday last_wd = weekday{last_day};
    days adjustment = (last_wd - wd).count() >= 0 ? 
        days{(last_wd - wd).count()} : 
        days{(last_wd - wd).count() + 7};
    return last_day - adjustment;
}

__device__ int64_t calculate_rule_transition(
    int year, const ZoneRule& rule, int32_t standard_offset
) {
    using namespace cuda::std::chrono;
    sys_days transition_day;

    if (rule.day_type == -1) {
        // Last weekday of month (e.g., last Sunday)
        transition_day = last_weekday_of_month(
            year, 
            rule.month, 
            weekday{(rule.day_of_week + 6) % 7} // Convert 1=Mon to 0=Sun
        );
    } else {
        // Fixed day of month
        transition_day = sys_days(
            year{year}/month{rule.month}/day{rule.day_type}
        );
    }

    // Apply local time and adjust to UTC
    auto transition_time = transition_day + 
                          microseconds{rule.local_time} - 
                          microseconds{standard_offset};
    return transition_time.time_since_epoch().count();
}

__global__ void convert_timezone_kernel(
    const int64_t* input_timestamps,
    int64_t* output_timestamps,
    const ZoneData src_zone,
    const ZoneData dst_zone,
    int num_rows
) {
    using namespace cuda::std::chrono;
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= num_rows) return;

    // 1. Convert source local to UTC
    const int64_t src_local = input_timestamps[idx];
    int32_t src_offset = 0;

    // Binary search over precomputed transitions (99% case)
    auto trans_it = thrust::upper_bound(
        thrust::seq, 
        src_zone.transitions, 
        src_zone.transitions + src_zone.transition_count, 
        src_local
    );
    if (trans_it != src_zone.transitions) {
        --trans_it;
        const int trans_idx = trans_it - src_zone.transitions;
        src_offset = (src_local >= src_zone.transitions[trans_idx]) ?
            src_zone.offsets_after[trans_idx] :
            src_zone.offsets_before[trans_idx];
    } else {
        // Fallback to rules (rare case)
        const system_clock::time_point tp{microseconds{src_local}};
        const auto ymd = year_month_day(floor<days>(tp));
        const int year = static_cast<int>(ymd.year());
        
        for (int i = 0; i < src_zone.rule_count; ++i) {
            const int64_t transition = calculate_rule_transition(
                year, src_zone.rules[i], src_zone.rules[i].standard_offset
            );
            if (src_local >= transition) {
                src_offset = src_zone.rules[i].dst_offset;
            }
        }
    }

    const int64_t utc_micro = src_local - src_offset;

    // 2. Convert UTC to target local (same logic as source)
    // ... [omitted for brevity; identical to source logic] ...
}

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants