I independantly wrote a program to match roads sections from different traffic databases using geographic information.
My supervisor at the University of Toronto Transportation Research Institute was involved in the calibration of a traffic simulation model covering the GTHA to real world conditions. This means modifying simulation parameters until conditions and traffic flows match those given by third party sources. Given observed data from HERE techologies, my supervisor wanted to know which sections in the simulation model were being described.
- over 100,000 road sections per database (strong time complexity requirements)
- naming is inconsistent
- location of road section splits are inconsistent between databases
- close proximity does not gaurantee correct match
- geographic coordinates carry up to 10 metres of uncertainty
Data is given as 2 GeoPandas dataframes, with each row describing a section. Information includes road name, connected sections, and geometric characteristics given as shapely objects. In total there are 160,000 and 300,000 sections per database.
My program consists of the following stages.
- Data is reorganized into a new data structure that allows fast queries based on geographic location.
- Algorithm matches roads between databases.
- Visualizations and statistics regarding matching process integrity are generated.
In the original Pandas dataframe, searching for a road section near coordinates is O(N) time. With N=100,000 queries this gives us O(N^2) time, or roughly 10^10 steps. I restructured key data into a stack of ordered lists, which allows for binary search (figure below). Each query takes approximately O(log(N/nl)) time, where nl = number of lists = 8000. N=100,000 queries runs in roughly 5*10^5 steps.
Searching for multiple roads near a location proceeds as follows. First we identify the relevant latitude lists, which never exceeds 3 given the user's specified matching distance. Next we binary search the longitudinal endpoints on each list to cut out the appropriate range. In reality we are searching a square, with the boundaries of the square being determined in no more than 6 binary searches.
The matching algorithm works primarily with geo-spatial information. Because separate roads commonly overlap or are close, we need a better algorithm than a simple distance heuristic. My solution combines a distance heuristic, with connected section information to filter out most incorrect matches.
Step 1: Find sections near each endpoint of the query road, and store them in 2 sets.
Step 2: We run a search algorithm to find a path connecting the 2 sets. If a path is found, and all sections along the path also lie close to the query road, then this path should be the query road itself.
Finally, the overlap distance and direction of each matched section to the query road is calculated via geometric manipulations of shapely objects. Optional visualizations are produced on every process to test program integrity.
94% per query when excluding freeways and sections shorter than 10 metres. 85% otherwise. Accuracy is measured as having all matches of the query road being correct. Note that accuracy is highly dynamic as overlap information and other parameters can be used as an adjustable threshold to tradeoff quantity vs quality of matches.
Instead of creating my own datastructure to allow for faster spatial based search queries, I would explore libraries that organize data according to 2 dimensions for me. This will make my code simpler and more reproducible.
I would explore utilising secondary information (ex. overlap distance, road names, road direction) as a second threshold to improve accuracy. Since the program would now have 2 distinct filters, accuracy is expected to be very high.
Skills Learned: GeoPandas, GIS, shapefiles, shapely visualizations, matrix/array operation time dependancies