Parsing Locations

grab the OSM database https://planet.openstreetmap.org/ - now its about 159 GB of compressed data - so let’s focus this in. I’m going to just pull down Austin, TX which is 153MB of gzipped data. Which expands into 2.1GB of XML data.

Convert that to PBF which brings the data size down to 67 megabytes.

osmium cat austin.osm.xml -o austin.pbf

Now extract the highways and the names

osmium tags-filter austin.pbf \
  w/highway \
  w/name \
  -o austin.named-highways.pbf

osmium export austin.named-highways.pbf \
  -u type_id \
  --geometry-types=linestring \
  -f geojson \
  -o streets.geojson

jq -r '
  ["id","name","alt_name","official_name","short_name","loc_name","old_name","ref","highway"],
  (.features[] | [
    (.id // ""),
    (.properties.name // ""),
    (.properties.alt_name // ""),
    (.properties.official_name // ""),
    (.properties.short_name // ""),
    (.properties.loc_name // ""),
    (.properties.old_name // ""),
    (.properties.ref // ""),
    (.properties.highway // "")
  ])
  | @tsv
' streets.geojson > street_gazetteer.tsv

Ok, with this pipeline we have a nice list of roads. We still need to dedupe the data, and so on. Now generally speaking new roads aren’t being added all of the time. New slang, and nicknames are being added and that will be a concern. But this basic idea is sound.

With this data we now need to build up the gazetter to store this data.

index name: gazetteer-street-v1 alias to : gazetteer-street

id: our internal id (a uuid?) osm_way_id: (the osm id) name (primary) aliases[] (alt_name/old_name/official_name/short_name/loc_name split into array) highway (OSM highway class) location (geo_shape linestring or geo_point centroid) bbox (geo_shape envelope or 4 doubles)

The ES query looks like

GET gazetteer-v1/_search
{
  "size": 5,
  "query": {
    "multi_match": {
      "query": "south congress",
      "fields": [
        "name^3",
        "aliases^2"
      ],
      "type": "best_fields",
      "operator": "and"
    }
  }
}

Now the trick is to produce the spans.

“ai engineers in south congress austin tx” <- i really want to find “South Congress”