How to Geocode a Large File?
From time to time, you may be required to geocode an exceptionally large file. This isn’t as complicated as you may think, but it still requires a lot of work. To geocode process a huge batch file of addressed, for example, hundreds of thousands or in millions of addresses can be a processing challenge. In this case, file size and the clean, efficient management of the data input and output matters plenty.
Different geocode vendors have different approaches when comes to geocode or reverse geocode file processing. Some may take an easy approach that basically lowers the bar on accuracy to produce some type of answer quickly. CSV2GEO take a different approach. We believe quality must be consistently high regardless of the quantity of data processed and always strive to deliver exceptionally accurate results.
In this example tutorial we will be using our CS2GEO tool (app). If you are using a different app that you are more familiar with, that’s not problem. The following few processing steps will need to be performed and checked when preparing your csv input data file regardless. When using an input file with large number of rows, the resulting output format will look similar. CSV2GEO supports almost all countries postal address formats from most regions around the world (rare exceptions include North Korea, South Korea, Japan, etc.).
Step 1.) Construct a UTF-8 formatted input file*. You must make sure you construct your data input file is with 8-bit Unicode Transformation Format (UTF-8) encoding. A very detailed explanation of how to prepare your UTF-8 file for geocoding is included in this link.
*Note: this is a critically important processing step. Many users assume ASCII format is fine, and it will deliver some results, but what they don’t realize are the limitations of ASCII format. That is where UTF-8 is critical. UTF-8 format is perfect for batch geocoding and map engines love it.
Step 2.) Know the price up front. This step can be critical depending on your budget which is usually a function of the value of the final produced results (relevancy and accuracy). CSV2GEO is an accurate, high quality batch geocoder app that we have made available to provide extremely relevant output. We aspire to provide the best quality results for the cost. We are so certain you will be happy with the results of CS2GEO that we offer free introductory trials as well as affordable subscription fees with very flexible payment options:
Pay as you go plan. With this option user can calculate the cost in advance using the quantity of data rows required. Registration and payment options can be done on the fly during a batch geocoding session or user can choose to register in advance. CSV2GEO does not store credit card data, as we rely only on secure data sources maintained by Pay Pal and/or Stripe, etc. We do support payment using Visa, MasterCard, American Express, in addition to Pay Pal.
Monthly basic subscription plan. If the customer or organization anticipates an ongoing regular need to processing geocoded data, they may choose to subscribe to monthly plan. We also provide a price calculator that can be used to help estimate your best price plan. If your monthly subscription runs out of credits, do not worry. Next month’s user credits can automatically kick in.
Multiple users designated by the subscriber account can be using the same monthly subscription. This is handy to large organization where different departments and multiple users can use CSV2GEO app on their own time. Interactive maps can also be created automatically from the end results to be manipulated/filtered by multiple users. We can arrange to send monthly invoices automatically to multiple designated users as well based on usage.
Subscribe with Application Program Interface (API). With this option users can purchase the right to use API. This option has all the features monthly subscription and also includes interactive maps from the of the data output of the API batch request.
Purchase CSV2GEO user credits. Sometimes users/organizations do not want to buy monthly subscriptions, but instead they prefer to purchase volume credits. That approach works if you know that your usage is inconsistent, with high and low usage periods. If there is need to process an exceptionally large batch file occasionally with many rows of address data, it may be more beneficial to purchase credits in advance instead of subscribing monthly. If this option sounds reasonable for your needs, you can simply purchase an estimated total for the quantity of rows you will need and use those credits to process when appropriate. Multiple users will still be able to share credits per the customer’s needs.
Step 3. Prepare a pilot file to test. We do not recommend running large files for geocoding immediately without some small data process testing, especially if you are doing this type of processing for the first time. To proceed, clone the file as separate “.csv” copy file to preserve the structure and format of the file. For example, include the first 50 rows and the last 50 rows only from your large file. This way, you preserve the file structure and have somewhat randomized sample to do a batch geocode test. You can refer to this tutorial on how to process batch of addresses into latitude and longitude for free.
After the initial test file run is complete, examine the output file carefully for the following:
a.) Do all rows return results? It’s important to note and expectation should be, if each row had a proper address as the input, then each row should generate a proper result in return. If for some reason one or more rows do not deliver a legitimate latitude and longitude coordinate, you must examine those data more closely. Check to be sure no address given is shifted into some non-address column. Be sure each data portion ended up inside appropriate address designated columns. Also, check to ensure the input value for addresses are not simply entered incorrectly. Often Customer Relationship Management (CRM) data is imported/exported with addresses from an old system and errors can (do) happen.
b.) Check the relevance of the output data. Is the relevance at or near 1.00? Relevance is a major factor indicating how reliable the resulting data is from the original input file. For example, if an address is, 3220 Moon Street, but in there is no 3220 Moon Street address that exist, but there is a 3218 Moon Street and a 3222 Moon Street, the mapping engine will likely report the relevance for this address line less than 1.00. In this case, it could be something like 0.90 or 0.95, depending on how inconsistent the other addresses on Moon Street appear to be. As you can imagine, there are some very peculiar addresses found in certain neighborhoods (sometimes known only to the local mailman and the people living there).
c.) Examine the structure of the output file. Does the output file look right/reasonable/expected? At CSV2GEO, we do not alter the actual input data file, but we DO include an output data clone of the original input file with as close to a relevance of 1.00 as exists. That way the user can simply import this new data into their system as corrected, legitimized data.
In addition, the latitude and longitude that come out of CSV2GEO are not projected. They represent the map as a round globe. If, for example, you import the data into Quantum Geographic Information System (QGIS), then World Geodetic System (WGS-84) will be the output data format for latitude and longitude.
Step 4. Once you have successfully run the test run, you are ready to run! At this point if you are confident processing the data file, go for it. If not, you may lack confidence in good results. Perhaps the technology requires time and concentration you just can’t provide, or you feel you just don’t have time to wrestle with and learn this for a “seldom-used” occasion. Please feel free to contact us so we can help you take full advantage of the power of CVS2GEO app. We can provide either running operational service or training to you and your staff at reasonable and affordable rates. We would love to see our customers become efficient self-users. But if the reality is that you are only going to need this data service every so often, then getting our help may make more sense for you than committing time and resources learning it. We still feel it is valuable that you understand how the system functions, even if you choose to have us do it for you. We believe well informed customers make good customers and we are committed to help you become confident in your choice.
Step 5 Let’s Geocode that actual large file. Remember, it will take time to load an exceptionally large file into and geocoding system and CVS2GEO is no different. We do a lot of validations trying to prevent issues and errors within the data transformation run. You will still get a preview of the first ten rows as part of the app’s workflow. Take time again to examine these carefully. If you have any doubt at all, please contact us through this link, or you can try to get us on a live chat so we can guide you through the process.
Eventually, the system will display a progress bar indicating progress of the geocoding as a percentage complete. The time for thousands, or millions of rows being processed with CSV2GEO will vary based on complexity of input addresses and available worldwide data sources. In normal circumstances (in modern countries with established postal systems) you can anticipate approximately 120 min for 1,000,000 rows in an input data file.
When done, you can simply click download and a large geocoded file in a form of a .csv file will be generated for you. You can return at a later time and download your file again as we will keep access to it (within a year of run time) In case you are worried about confidential information, we are GDPR compliant and always work in compliance with USA’s HIPAA guidelines. At your request and consent, we will delete your data task from our system based on your data life management requirements. To make that requirement, just go into your work history and click settings to make appropriate desirable selection.
S croll to the delete tasks section, enable it and delete any or all tasks you desire. If all tasks are deleted, you can even delete permanently your entire account (please be sure you understand this completely before selecting that option since there is no turning back. Your data will then be permanently deleted at your request without any method available to be recovered within our secured system servers).
Relevant articles that may help you when you try to geocode a large file: