Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add LastModified to FtpFile #147

Closed
wants to merge 1 commit into from

Conversation

PeterRimmer71
Copy link
Contributor

For FtpWorkers the worker the FtpFile currently has,

public class FtpFile<TPayLoad>
{
    public FtpMetadata Metadata { get; set; } details of the file on the ftp server
    public Instant RetrievdAt { get; set; } instant the file is retrieved
    public string CheckSum { get; set; } computed based on the file contents
    public TPayLoad ParsedData { get; set; } defined in the worker
}

public class FtpMetadata
{
    internal FtpMetadata(FtpEntry entry)
    {
        Entry = entry;
        ResourceId = entry.FullPath;
        Modified = LocalDateTime.FromDateTime(entry.Modified);
        ModifiedSources = null;
    }

    public FtpEntry Entry { get; }
    public LocalDateTime Modified { get; }
    public Dictionary<string, LocalDateTime>? ModifiedSources { get; }
    public string ResourceId { get; }
    public object Extensions => new { Entry.Name, Entry.Size };
}

The idea is to add LastModified to the FtpFile class where LastModified = laststate.Modified.

EntsoE SFTP files have a column which is the UpdateTime for each row in the file. Adding LastModified enables filtering of the data to write. The data to write would be rows with UpdateTime greater than LastModified.

Copy link
Contributor

@AndreaCuneo AndreaCuneo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR describes a use case related to Contents of the file being watched.

The change instead expose the previous state.Modified which has nothing to do with contents of the File, as part of the Retrieved file.

Can you please better explain why the GetResource() that is only responsible of Downloading the File should be extended to add a single property of the previous known state for this file?

If I guess correctly, what you're aiming for is to make the previous state available to Application code when Handling the file.

Also, I don't think what you are aiming for works. You have no guarantee that UpdateTime and the file Modified have any relationship. FTPs can modify files for reasons independent from the contents and in different moments.

Lastly, the Modified is not guaranteed to be sync with file Download: between when File is listed from FTP and is Downloaded, it may get written/update.
The only guarantee is that in this case, at next check, the library detects a new change as Modified changed and downloads the file again. But the checksum detects that is identical to last, and so is skipped after download. This to say that Modified and contents are unrelated.

I strongly advise to base on the Modified any logic relate to data contents.

@PeterRimmer71
Copy link
Contributor Author

The aim is to expose details of the previous state which can then be used by the application code to filter what is to be written.

I was thinking that the UpdateTime does have a relationship with file modified date but if that's not true then this approach is doomed.

The only additional thought is that maybe RetrievedAt from the last state could be used. That is the datetime when the file was last processed.

@AndreaCuneo
Copy link
Contributor

You're still doomed, as file could be generated 10min before and then written.

What you need to Track is the max UpdateTime and store as part of the State Extensions to reuse during Parsing or Handling the next run.

But the Writer doesn't receive the Prev state, only the current as part of the Metadata. Adding this requires revisiting the Interfaces passing read-only to Consumers.

But this change is not 'FTP' related, is Core: pass State tracking to Processors in addition to the Resource.

That is a major revision.

@PeterRimmer71
Copy link
Contributor Author

OK. So, there are a couple of things needed to enable this:

  • FTP worker we needs to be able to add custom items to metadata's extensions.
  • include tracked state for processors

I'm not sure what the estimated effort would be to implement this and also what would be needed to justify the effort.

@PeterRimmer71
Copy link
Contributor Author

I added some logs to the worker to collected some information about the modified timestamp and max(updatetime) from the Entso-e SFTP...

filename: 2024_03_AggregatedGenerationPerType_16.1.B_C.csv
RetrevedAt              Modified                Max(UpdateTime)
2024-03-22 15:48:45.82  03/22/2024 03:12:06 PM  2024-03-22 16:11:51
2024-03-22 14:47:52.95  03/22/2024 02:11:15 PM  2024-03-22 15:10:49
2024-03-22 13:45:04.64  03/22/2024 01:12:13 PM  2024-03-22 14:11:34
2024-03-22 12:44:52.99  03/22/2024 12:13:14 PM  2024-03-22 13:12:16

@AndreaCuneo AndreaCuneo closed this Apr 2, 2024
@AndreaCuneo
Copy link
Contributor

closed. The approach is not correct as is a narrow interface leak from a library perspective.
additionally doesn't really support in achieving the result which requires a tracking of an additional (inner) checkpoint which is not supported.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants