The Trials and Tribulations of Open Data – How We Captured It and How You Can Too

Many states have recently passed legislation requiring its department(s) to maintain and refresh open datasets for public consumption in a business and not for profit capacity. This is a great opportunity for businesses to understand the local landscape of competitors and consumers within their local jurisdictions. For instance, if you were a growing business and wanted to target new territories to market your business, wouldn’t it be helpful to know the geographic regions with the highest density of your target market/consumers? Having open access to leverage and analyze these public datasets can make these otherwise challenging decisions much more efficient and effective.

While it is becoming easier to access open data, there is still the challenge of obtaining public data and incorporating it into your organization’s internal, proprietary data. SC&H recently leveraged Maryland’s Department of Information Technology Open Data Portal [(] to transform Maryland’s State Department of Taxation & Assessment (SDAT) raw data into a dashboard, allowing viewers to see a geospatial market map by county with links directly to Maryland’s Real Estate Property Data Search.  This made it WAY easier to consume the large SDAT dataset of over 3 million records and glean trends and meaning from it.

SC&H utilized a ‘Data Warehouse’-style Pipeline to process external data using the following steps:

  1. Ingestion – Collecting public data using comma-delimited file formats from Maryland’s Open Data Portal, combining & normalizing data into one central location, Microsoft SQL Server.
  2. Data Warehouse Transformation – Raw data is messy; it must be transformed from a large flat file structure, then stage-like objects in SQL Server, to finally a clean presentation reporting object.
  3. Classification – Data without hierarchical categories is hard to interpret; we provide its true value by analyzing & classifying data into groups for easier understanding. Classifications include geospatial hierarchies, land use types, and property structures.
  4. Visualization – Data in its tabular format is meaningless; we provide even more meaning by analyzing and visualizing it for consumers to gain insights.

How did we do this? At a high level, we performed the following steps:

  1. Public Data Extraction: Extracted large comma-delimited files directly from Maryland’s Open Data Portal, supported by Socrata. We saved these files locally for further transformation.
  2. SQL Server Data Ingestion: Imported comma-delimited files locally using data import feature in SQL Server, using a large variable string data type to ensure successful loading of all records.
  3. Stage Layer: To ensure we maintain complete data integrity and no historical records are removed, used a stage object layer to cleanse data prior to consumer view; applied error handling in this step using SQL Server stored procedures and triggered functions.
  4. Transformation (Store Layer): Assigned appropriate data types and transformed data in this step before we push data to a final presentation layer.
  5. Presentation Layer: This is the cleanest and most optimal layer for consumers to use data for reporting & analytics; the final layer is using a view object within SQL Server.
  6. Visualize in BI Tool: Prepared advanced visualizations in Tableau

Want to see the end visualization we built as a result?  Here’s a sample dashboard SC&H created, demonstrating a map containing county details with a drill down feature to specific property details:

*You can see above the complete map of Maryland with a density of total property value displayed as a color; when you click on the county, it will take you to a detailed look at cities with a property map drill down*

In conclusion, with the right strategy and thoughtful cleansing technique, you can leverage large, messy, & disparate datasets in conjunction with your existing internal datasets, enhancing your organization’s look into the local and state level landscape.

While it may seem challenging to get your organization leveraging all this data, bringing in a partner like SC&H to simplify the path to data-driven decisions can make the process easier, and ultimately deliver valuable results for your business stakeholders. Get in touch with our team and learn more.