What Components Are Necessary When Geocoding a Batch of Addresses?

There is no historical data telling us when we first used written addresses, but presumably it was long ago. Today, addresses including latitude and longitude coordinates are an essential part of our life. Location driven searches account for at least 20% of all traffic on the internet. Millions of addresses are geocoded daily and the majority of those are done with batch geocoding tools. However, only very experienced geocoders appreciate how address components are used during the geocoding. Today, we will go over the specific details of how best to choose address components necessary to perform well executed batch geocoding.

There are many standards for address around the world. The International postal components and template language (link) is one of them. There are also several academia guides like this (link to Frank’s Compulsive Guide) which defines guidelines to help people and organizations see through the jungle of standardization. Ivan Assenov, founder of  Scale Campaign, presented a paper called “Will Geocoding Decide the war Between Humans and AI” on  BIG DATA IGNITE  which estimates 40K to 50K address components worldwide are “in play”. That is a huge number of address components for humans to deal with. Obviously, therefore, when dealing with geocoding projects these component combinations needs to be simplified.

The approach we have taken within CSV2GEO is quite simple.

We allow users to input a virtually unlimited number of columns into our system when batch geocoding and for many good reasons.  The primary reason is to provide users with the ability to proceed using the same file including geographical addresses from multiple postal regions (countries, languages, etc.), at the same time, without worrying about how the map engine will make identifications.

As an example, let us use three addresses to illustrate the best approach to engage components used when geocoding.  For this example, we will target (3) art museums from (3) different regions:

  1. Museum of Art of São Paulo Assis Chateaubriand, with address:
    Av. Paulista, 1578 Bela Vista, São Paulo - SP, 01310-200, Brazil
     
  2. Louvre Museum, with address:
    Rue de Rivoli, 75001 Paris, France

 

  1. The Art Institute of Chicago, with address:
    111 S Michigan Ave, Chicago, IL 60603

Each of these postal regions includes its own unique set of traditional (required) components.

There are several approaches to address this problem, but before we show them, the simplest answer to the main question is: we will use all components required. Do not remove address components because something does not fit with the rest of the data. That may mean your data maybe inaccurate and that address shows the issue.

We also noticed that the first address has grouped the address parts into 6 tokens, the second address has 3 tokens, and the third address has 3 tokens.  It would be most straight forward if all museum addresses constructed of the same number of address tokens, and that may be the case inside a selected project.  However, that not always the case nor is it necessary as we will see.

Example # 1:  Let’s look at how to convert address to latitude and longitude below:
Go to the site: csv2geo.com, then copy and paste the first address into the active data input window as shown below (the one from Brazil).

how-to-convert-address-to-latitude-and-longitude

Once you have pasted the data into the input box, click “Go”

Results should look something like this (below)

designate each address token to describe what portion of the address will each token contain

The user will then need to designate each address token to describe what portion of the address will each token contain. The system simplifies all parts into 6 categories: (street number), (street), (city or jurisdiction), (state or province), (zip code or postal code), (country or region). The actual selections can be done by clicking on the header of each column and select one or more items from the drop-down menus).

If a column of data does not belong to this standard formation of the address, just ignore it.

If a column of data does not belong to this standard formation of the address, just ignore it

Next, click “Process”.

Your results should appear as a map, as below.

Your results should appear as a map

Example # 2:  Now, let us try a batch file.  Create an Excel.cvs file with multiple (in this case, 3) addresses.  We then need to insert this file of multiple addresses as a “batch” to process.

Once the Excel.cvs file is complete and stored in your local hard drive, input this file by using the box marked “Browse or Drag & Drop .cvs File Here” (shown as an orange box on the home page) and repeat the organization of the tokens in column headers A thru F.

Again, next click “Process” to generate your results.

A preview of the first few addresses comes up with a, map with markers, and a downloadable file that contains a preliminary output file (preview).

For column A thru F, select appropriate data component type.

A preview of the first few addresses comes up with a map with markers

Again, once done identifying each component type (column header menu), click to process the data again as we had done before in Example #1.

If the preview data looks reasonable and accurate (as expected), click on “Next Step -Get All Data” button to generate the complete batch processed output file.  In our example we only used (3) museums, but if you had used (300) museums, the result would be similar.  However, you would only see about (10) in the preview.

the preview data looks reasonable and accurate

As you see from the results, nothing got lost when geocoding the addresses.
 

In summary, when the question of what components are necessary to do geocoding for a batch of addresses, the answer should be use all of them, as there is a way to fit all of them inside the geocoder using geocoding tools like CSV2GEO.

geocoder using geocoding tools like CSV2GEO

Follow us on twitter and like us on Facebook.