Friday, March 15, 2013

Fast Dijkstra’s Algorithm Inside MS SQL CLR Stored Procedure

Few days ago I had to write an algorithm for finding fastest route between two points on an Open Street Map.

One of the most popular routing algorithms is Dijkstra’s Algorithm. To perform route search we need a data representation of Vertices and Edges connecting those vertices. It is also important to have Cost assigned to edges. If Edge is bidirectional we also need to have Reverse Cost – so we know how expensive is to return back using that same edge.

The algorithm itself uses that information for finding shortest path between provided vertices.

Here is description from Wikipedia

Dijkstra's algorithm. It picks the unvisited vertex with the lowest-distance, calculates the distance through it to each unvisited neighbor, and updates the neighbor's distance if smaller. Mark visited (set to red) when done with neighbors.

Here is another

Illustration of Dijkstra's algorithm search for finding path from a start node (lower left, red) to a goal node (upper right, green) in a robot motion planning problem. Open nodes represent the "tentative" set. Filled nodes are visited ones, with color representing the distance: the greener, the farther. Nodes in all the different directions are explored uniformly, appearing as a more-or-less circular wavefront as Dijkstra's algorithm uses a heuristic identically equal to 0.

So basically what we need to do is:

  • Go from each vertex (starting with source vertex) to their neighbours. Important is we always must pick vertex with smallest distance first on each iteration.
  • Update each vertex with smallest distance (each vertex may be visited through many other vertices so there are many distances per vertex – but we are interested in shortest one)
  • Also as we update vertex with smallest distance  we must store the source vertex from which we came here so later we can reconstruct the path
  • Once we reach target vertex – that means we have found shortest path.

Lets start by creating main classes for our algorithm

Vertex:

Each vertex has reference to Previous vertex – the source from where during path search current vertex was visited and the cost was shortest

Also each vertex has reference to all neighbour vertices – this is pre-populated before search starts so the search process is as efficient as possible.

Visited flag is set to true when we marked with distances all neighbours of a vertex. If vertex is already visited then it’s neighbours will not be iterated anymore.

Cost – is total distance to the vertex from source vertex.

As you already might noticed there are two types of properties on the class – one is pre-populated and reused between many searches, and the other needs initialization before each search.

Here ID, Neighbours are pre-populated properties – once we initialize the all vertices we can reuse those values between searches.

Other properties need initialization for each vertex in the graph before each search

  • Visited needs to be set to false for all vertices
  • Cost needs to be set to max value
  • PreviousVertex must be set to null
  • PreviousEdge must be set to null

Here is our Edge class:

SourceVertex and TargetVertex are references to Vertex objects that represent start and end point of an edge

Here Cost does not change from search to search – because this Cost differs from vertex Cost. Vertex Cost updates during search and is distance to the vertex from source, while Cost in edge is distance between start vertex to end vertex.

IsReverse also does not change – this property indicates whether this edge was created using ReverseCost of original edge from database (To avoid confusion I will explain here that database will contain bidirectional edges with values cost and reverse cost showing how expensive is to move from start vertex to end vertex and vice-versa. During search however we need to double edges – the search graph must have 2 edges for each bidirectional DB edge – one with unchanged Source and Target Vertex properties and other with swapped vertices.)

I can say the Edge class properties does not change during search. Once initialized the Edge objects can be reused between many searches

To pass data to graph we need some POCO object representing minimum amount of properties enough for path search.

For simplicity I use the same POCO object to return result from graph search. So there some properties specific to result.

The GraphDataItem class:

Properties used to pass data inside search are:

EdgeID, SourceVertexID, TargetVertexID, Cost, ReverseCost

As you see Clone methods used exactly that set of properties. when returning result from graph I clone the item excluding result-specific properties and I set those in result method without modifying reusable edge objects.

To separate result construction from search algorithm I created dedicated class returning IEnumerable<GraphDataItem> and accepting found target vertex, and some informational properties (time spent to search and tries made before finding target).

Here is GraphSearchResult class:

As you see the result is constructed right in the constructor. the passed vertex is walked to the source using the PreviousVertex reference. PreviousEdge is used to retrieve the original data item and return it too.

So what we have – we can populate GraphDataItem objects collection and pass it to graph, graph will create cache for fast search based on the data we passed in and reuse that cache between many searches. graph will then need to find path is we provide path between which vertices we need and return is using GraphSearchResult helper.

Here is the Graph class:

Graph class has class level variables for in memory representation of all vertices and edges and their interconnection. mEdges and mVertices variables store that schema (keys are IDs of the items stored in values).

Also Graph operates with two types of Vertex ID – original ones provided and in-memory ones – which are sequential for algorithm realization simplicity. mVertexDictionary variable stores the mapping between original and algorithm-related IDs.

The algorithm you see here actually experienced some evolution during development.

To explain the changes and importance of the changes I will return again to very important (critical I would say) point in this algorithm.

I test this algorithm using real data which I obtained by downloading whole country (Georgia – Caucasus) OSM map and processing it using osm2pgrouting which created the SQL script with vertices and edges info. I then imported the script into pgSQL server and after few data conversions was able to import into MS SQL Server, which is desired server for our purpose.

In essence the data required for routing is inside one table named Routes. It looks like this:

image

I think I will need to write a tool which generates routing table from OSM and inserts it directly to SQL Server. This should be perfect subject for next blog post.

Here we can use source, target, cost, reverse_cost, id as source vertex id, target vertex id, cost, reverse cost, and edge id properties. we can pass this inside our graph class and it will do the job.

To return to the evolution of the algorithm I want to emphasize the size of this routing table.

It has 180 000 + rows representing smallest road segments in the country.

With first version of algorithm I was not able to find path at all – the system was doing very slow iterations and just hang

Second version of algorithm needed about 2-5 seconds to find path between two farthest points in the country.

After optimizations third version needs about 0.1 second on my PC  for path between two farthest points in the country.

Please note that we have 180 000 bidirectional edges and after creating in – memory graph we actually have more connections between vertices that in DB table.

Basically the search of nodes is iteration over vertices and their neighbours. If we also remember that one vertex is visited only once we can assume that not much problem should appear if we iterate even over hundred million records.

BUT the problem IS – in each iteration we must choose next neighbour to visit. And this is crucial part of algorithm. The next neighbour must have the smallest current distance from the source. And this unfortunately means we need to perform SORT once per iteration. This is extremely slow operation if we consider it will be performed hundred thousand times.

First version

In first version I used mVertices.Values.SortBy(vertex => vertex.Cost).FirstOrDefault() to get next vertex to visit.

as you know mVertices has about 180 000 items so the application just died.

Second version

In second version I logically excluded from sort vertices that has max value distances.

This is done my maintaining enumerated (estimated) neighbour vertices having non max value distance in separate collection (mNextVertices)

then I just used this code to get next vertex to visit: mNextVertices.Values.SortBy(vertex => vertex.Cost).FirstOrDefault()

This was a huge step forward and I was able to search large paths in 2-5 seconds

Third version

The final version is what you see in this blog post now.

I decided that inserting into SortedList should be less expensive operation than sorting unsorted collection.

In second version I was inserting items into unsorted dictionary – and that operation was fast, but during the sort a lot of CPU time was lost.

In this version I insert into SortedList each iteration but I dont need to sort because SortedList is always sorted – I just can take first item)

The result was huge – I need just a fraction of a second to go through very large collection – it is about 0.1-0.2 seconds per largest searches in a country.

 

Creating CLR Stored Procedure for routing.

After I finished testing the algorithm I decided it will be nice to put this functionality into SQL Server SP. We have all data there – no need to move huge data collections between servers.

There are two potential ways to use the algorithm in procedure

  • Mark it as UNSAFE and let is cache initialized in-memory graph in static variable. Each search will be fast and reuse the graph. If you go that path please use lock to ensure only one thread search at one given moment. Graph objects are modified during search and one graph object cannot search in parallel.
  • Mark is as SAFE and don’t use static variable – just initialize graph once per search.This way we don’t need to lock access – for each call we create new graph.

I used first approach in my case, but for blog post I implemented SAFE SP which is more convenient for demonstration purposes. It is easier to deploy as well.

I created database project and added CLR SP

image

The SP code itself is simple

The SP will accept:

  • select statement – string which is the select SQL statement returning the below columns for whole graph
    • EdgeID
    • SourceVertexID
    • TargetVertexID
    • Cost
    • ReverseCost
  • source vertex ID
  • target vertex ID

For our table the SQL statement will be

image

But the main point here is – we can change SQL later without rebuilding the SP – we can change for example cost criteria and return distance to find shortest path but not fastest etc.

I modified Graph class a little so it can use the SQL statement to fetch the data

Here is added factory method

And that is all actually.

After that I created helper function which returns vertex ID from our routes table by provided Lat/Long pair.

When this is done the SQL wrapper SP for search looks like this:

To test is we simply can run this query:

image

 

To demonstrate the usage I created simple maps host application:

image

After you run it you will see this:

image

You can make very large path search inside the country:

image

I will provide a short description of how to run the sample.

First – download it here

Second – open solution file in Visual Studio 2012

Third (important) – set database project as startup one

image

Run

image

Important:

The database project is set up to create database named “Routing” to localhost – default instance of SQL server. Please make sure you don’t have already database with this name – please read this carefully as I am not responsible for potential damage you can make on your machine.

Next - if deployment succeeded make sample project as startup one

Now run the application.

When you first run it – the routing data will be imported into DB (localhost/Routing again)

This can take a minute or two but this is one-time operation – next time you run the app this will not need to be done again.

That’s all – you can now use app and see Dijkstra’s algorithm implementation in action :)

Happy coding :)

24 comments:

Anonymous said...

Thank you, Kirill! Great article!

Anonymous said...

Wonderful!

Anonymous said...

Great Job Kiril!

2 questions about the c# version tho -
How would you instantiate your graph and call it(usage sample)?
What changes would you make if the graph is undirected?

Michael

Kirill Chilingarashvili said...

Hi Michael

CLR stored procedure code is an example how to use this
If you don't want to experiment with DB CLR, just use that code in console app for example, but you will need to populate the graph differently


Unidirected graph - you will need to pass same value for Cost and ReverseCost for GraphDataItem

Ryan Bradbury said...

Hi Kirill,

Great blog! I was able to run your project and it worked great, very fast! I was hoping to recreate your project for a different map. However, I currently have all of the vertices and edges in a script but cannot convert it into the data format you have in your Routes table.

Any help would be greatly appreciated.

Thanks!

Kirill Chilingarashvili said...

Hi Ryan
Thanks

For this sample I went a little long preparation path
I installed pgsql, imported data using osm2pgrouting free tool , and then exported it to sql server

Kirill

Anonymous said...

Hi Kirill, My format in pgsqlserver was Geometry(LineString,4326).. How can I convert this so that it works in SqlServer?

Kirill Chilingarashvili said...

You can export the shape as string and then import the string straight into geography column

Anonymous said...

Hi Kirill,

Can you give me more details on how you convert the pglsql Geometry(LineString,4326) to Geometry in sql server? Your column is geom of type geometry

Kirill Chilingarashvili said...

You can use STGeomFromText to convert string into geometry instance

Anonymous said...

Hi Kiril,

When I exported from Pgsql, my string was in a format like this
0102000020E6100000020000003E20D099B4C152C0D828EB3713F94340C539EAE8B8C152C08466D7BD15F94340

I noticed this is different from your values because yours begin with 0xE.... and are much longer. I think this is what is giving me errors because sql server will not let me do any conversions with this string. Do you know why my string is different than yours?

Kirill Chilingarashvili said...

I think you cannot import that kind of string
you need to export geography data from pgSql in simila form to the following:
'LINESTRING(2323.2323,3434.34)'

Anonymous said...

Hi Kirill,

I was able to export the data into sql server but now i am trying to mimic your table structure and I do not have a cost. I have reverse_cost and to_cost but all of my to_cost have null values. Did you also experience this?

Anonymous said...

Hi Kirill

When I run this with my own OSM i keep getting return 0 on this code.

IF @s IS NULL OR @s = 0 OR @e IS NULl OR @e = 0

What can I do to fix it?


Kirill Chilingarashvili said...

Look into original table in pgsql
if there the cost is null too than you have bad OSM file or the area just does not have roads informaction..
I mean I had both cost and reverse in my file

about null-s
it seems that by some reason GetClosestVertex returns no vertex for your coordinates

if look into that function - you see that it searches only within small area
SET @Distance = 0.005
you can modify that or just provide coordinates which are near to road vertex

Anonymous said...

Hi Kirill,

Thank you for all your help so far, I think I am very close to a solution.

I decided to use length as my cost. I think this should work because it will give shortest distance.

For the GetClosestVertex, I changed it to 1.000, which is a value much higher than yours and it still is not finding a vertex. Could this be because I do not have a osm_source_id or osm_target_id column in my data? I am not sure how you got those columns.

I am trying to simulate your exact solution with the city of Philadelphia.osm

Kirill Chilingarashvili said...

No problem

Well I think the GetClosestVertex needs to be debugged than

You can debug with Management Studio on DB machine

Actually you already must have working solution - just you need to provide vertex IDs yourself
GetClosestVertex just returns the ID by Long/Lat

Can you post sample data - few lines and structure of your routes table?

I will look into sp and see what can be the issue

Anonymous said...

Kirill,

Your solution works for me just fine but when I try to make for myself with Philadelphia.osm, I have problems. This is what my table looks like

http://imgur.com/gwV9L6d

As you can see i'm missing Osm source & target ids, which i'm assuming are for the nodes.

Kirill Chilingarashvili said...

This is how simple the getClosestVertext is in my latest version

ALTER FUNCTION [dbo].[GetClosestVertex]
(
@Point geometry
)
RETURNS int
AS
BEGIN
SET @Point.STSrid = 4326

DECLARE @Result int
DECLARE @Distance real

SET @Distance = 0.005

DECLARE @X float, @Y float

SET @X = @Point.STX
SET @Y = @Point.STY

SELECT top 1
@Result = R.[source]
FROM
dbo.Routes R
WHERE
@Point.STDistance(R.geom) < @Distance
ORDER BY
@Point.STDistance(R.geom)

RETURN @Result
GO

so it is clear that geom objects cannot be found in the table by some reason

make sure this select returns geom closest to the provided point

also make sure you correctly imported geom instances - did you correctly use long/lat (my mistake which I keep repeating again and again is - I often need to swap lon/lat)
also make sure you pass point instance with correct lon lat

Erkin said...

Hi Kirill,

I exported the geom as linestring from pgsql and then imported it directly into SqlServer. When i look at spatial map i can see all of the routes and they look like the city of Philadelphia. I tried swapping lat/long and both returned 0 for me. This is what my locals look like when I debug.

http://imgur.com/7HG2Zoj





Anonymous said...

Hi Kirill,

When I do this with Philadelphia.osm I get the following errors:

Reference nd=2572374823 has no corresponding Node Entry (Maybe Node entry after Reference?)

When I try with Georgia.osm, I don't get these errors

Gadi Rosen said...

Hi Kiril

My name is Gadi, and your article is exactly what i need. I'm trying to build some Routes by giving the id's of 2 Junctions (not Vertices) and for now i don't need to build yhe shortest way. For that purpos i have to customize your SP and write it on Sql Server 2008. I have all the Data including the Segments, Junctions and the Length of each Segment in one Table. I need your help .

Gadi Rosen said...

Hi Kiril.

This is Gadi again.
To complete the Customization of your SP i need a Sample of the Table name: GetShortestPathForStatement (Content & Field's Type), and also the Function Code of fnGetDistance.
I'll be glad if you could send me those to my Email: Rosem.Gadi@gmail.com.

And again, you did a great Job by writing this Blog.

Thank you in advance, Gadi.

Kirill Chilingarashvili said...

Hi Gadi

I dont have the sample installed atm
you can install the sample- it creates all scripts needed