Summary
The Cascadia Clean Technology research project was undertaken to create a more wholistic picture of the regions strengths and expertise in the green technology sector, to better understand sector concentrations, and identify some of the many producers of goods/technology, services, and accelerators operating here. Representation evolved into a digital map for ease of visual impact and sorting. The map structure created is intended to allow revisions, updates, and additions to reflect the evolving nature of this sector and in recognition that the current representation is not exhaustive. The Project was undertaken by a Virtual Student Federal Service intern team, guided by the U.S. Commercial Service.
Overview
The climate technology industry is new, complex, and growing. In order to support American exports from this growing industry, this team’s goal was to create a map that would make it easier to find and learn about important companies in the Washington and Oregon climate technology space. Our focus was especially towards companies that could be potential exporters from the United States.
In order to create an accurate map, we needed to determine what types of companies fell under the umbrella of green technology, and how we could most usefully sort them. At first we considered using a system already created by a consulting group, database, and/or research paper. However, we soon determined that the limited systems that had already been created were either not comprehensive enough or detailed enough for our purposes. Therefore, we created our own based on research into the industry and with inspiration from other systems.
Our system is also intended to be useful for those searching for specific company types in the map, and can be applied to green technology sectors across the United States. Our categories are:
Process
To begin our research into the climate technology industry, we searched for information online. We found various definitions of climate technology, some being more restrictive while others being very general. We then searched for different climate tech companies, research efforts, and organizations in the industry in Washington and Oregon, which gave us valuable information and leads. Online research gave us a good feel of the industry so we could further develop the map, but we wanted to dig deeper.
To further understand the climate tech industry, we sought to interview experts and leaders in the industry in Washington and Oregon, sought out thought leaders and speakers from climate technology focused webinars and events. We also utilized LinkedIn and other sources to find more industry experts to assist in our research. To set up informational interviews, we reached out to these individuals through cold emails.
From the responses we got to our emails, we set up informational interviews via Zoom to ask questions about the industry and tailored questions to the interviewee’s expertise. Examples of questions we asked include “What do you think is changing the most about climate tech currently?” and “What are the biggest challenges in the climate tech industry right now?”. In addition to climate tech industry questions, we asked about our plans for the map to figure out how the map can be used and what features they would find useful.
The interviews revealed a wealth of information about climate tech that greatly improved our knowledge of the industry, specifically in Washington and Oregon. We learned about the next big things in climate tech, how companies are changing to adapt to environmental sustainability, and what challenges the industry is currently facing. Specifically, energy storage, offshore wind, zero-emission heavy-duty transportation, and sustainable construction were mentioned as some of the fastest-growing sectors in the Pacific Northwest. Buildings are responsible for about 40% of carbon emissions, yet the standard for clean construction has only gotten stricter. Combined with the unique construction needs for different types of buildings, it is difficult to streamline clean building construction. Additional industry challenges include a lack of proper government support and the risks associated with making large investments in the industry.
Hearing various opinions on our map was a huge help for our project as well. The interviewees responded positively to the idea of the map and said that it would be a valuable resource. They also proposed different uses for the map that helped us understand what our users are going to look like. One interviewee loved the idea of using the map to find funding agencies to aid in climate technology research and the development of new products. Another said that it would be useful to find manufacturers to help supply chain efforts, as finding sustainable manufacturers and contractors can be difficult.
Our research through internet resources and interviews advanced the project and improved our knowledge of the climate tech industry. By understanding the different forces and players within the industry, we could better tailor our map to potential users. We now understand that climate technology covers more sectors than we thought and that our map will have many practical uses.
Program Sets and Data Sources
Data sources:
A variety of company databases were utilized to construct our lists which included, but not limited to, Orbis, Hoovers, Apollo, USASpending, and Crunchbase and via simple web searching.
Filtering:
All of the larger datasets pulled from Orbis, USASpending, etc. that had a high percentage of entities that did not meet our criteria were initially filtered using scoring/validation programs to identify entities that were most likely to be good fits for our criteria. In the case of lists of entities that had received federal funding in the form of grants and contracts pulled from USASpending and Salesforce lists that were pre-filtered to be cleantech-oriented, minimal manual filtering was needed. However, for the lists pulled from Orbis, due to there being minimal company descriptions provided by the service and the very broad initial filtering by industry codes, extensive manual checking was needed on top of the automated scoring done by programs.
Attribute-Finding:
The vast majority of attributes (contact information, address, etc.) were unable to be found in a non-manual fashion due to both the absence of a single reference for such information and the often dated state of such sources. Thus, finding information on entities was mainly done manually, and such details are captured in the following write-up.
Program Sets:
Program Set 1: Link 1 Link 2
Purpose: Extract keywords from company pages given Excel file of company URLs pulled from company databases, and attempt to use those keywords to narrow our potential “wanted” list.
Language: Python.
Notable Libraries: pandas (read in Excel sheet), requests (to query the API), xlwt (to write output to another Excel workbook).
Process: Used the TextRazor Natural Language Processing API to classify each company’s homepage by a predefined set of entities and topics.
Result: Not used. Database-provided classification (SIC) was used to narrow company lists.
Purpose: Help to narrow down company lists by assigning each company a “greenness score.”
Language: Python.
Notable Libraries:
Result: Program was run on both Oregon and Washington lists. Headless (not requiring GUI) version was made for easy deployment on personal server. Estimated ~20k companies per day processing capability, spread out across virtual machines, physical machines, and servers.
In addition to company databases, at this point, I also pulled lists of Oregon and Washington entities that had received federal grants/contracts from the USASpending website. These lists were tremendously huge, dwarfing even the Washington company lists as pulled from the databases previously (~113k), running into the millions of rows.
Program Set 2: Link 1 Link 2
Purpose: Help to narrow down company lists by assigning each company a “greenness score.”
Language: Python.
Notable Libraries:
Result: Program was run on both Oregon and Washington lists. Headless (not requiring GUI) version was made for easy deployment on personal server. Estimated ~20k companies per day processing capability, spread out across virtual machines, physical machines, and servers.
In addition to company databases, at this point, I also pulled lists of Oregon and Washington entities that had received federal grants/contracts from the USASpending website. These lists were tremendously huge, dwarfing even the Washington company lists as pulled from the databases previously (~113k), running into the millions of rows.
Program Set 3: Link
Purpose: Help narrow down USASpending lists by applying regex terms found through Grace’s research.
Language: Python.
Notable Libraries: re, again for regex matching.
Process: Assign greenness score using the same method as in Program Set 2, but this time based on award descriptions provided by USASpending for each grant/contract awarded to each entity.
Result: Was much faster than Program Set 2, since it did not require any web functionality. Successfully processed all lists by post-fourth meeting.
Program Set 4: Link
Purpose: Validate email addresses given a Salesforce data pull using MX records.
Language: Python, Excel VBA.
Process: Email addresses read in one by one, domains were stripped from email addresses and DNS lookups were run on those domains. If no matching DNS record and/or no matching MX record (the special type of DNS record that allows mail to be received), the listing was marked as invalid.
Result: Used successfully. Excel VBA version was created (~Meeting 16) to better conform to Department IT policy if it was indeed deployed internally.
As Tableau and other solutions were proving to be too complex or too feature-limited to display geographic data with extensive filtering capabilities, I began exploring other types of maps, like cluster and tree maps.
Program Set 5: Link
Purpose: Automate scraping of NAICS codes from MailingLists NAICS Lookup given company name, city, and state.
Language: Python.
Notable Libraries: Selenium, pandas (now used for both reading in Excel file and outputting), PyAutoGUI (again, copy all text on results page), re (pare down text on results page to extract NAICS code).
Process: Selenium drives Firefox to visit the MailingLIsts NAICS Lookup page, company name and city extracted from Excel sheet and website form filled out. Results page scraped, NAICS code extracted using re and written to output Excel file.
Result: MailingLists changed their website, making this program obsolete.
Program Set 6: Link 1 Link 2
Purpose: Query Nominatim and Google Maps geocoding APIs for coordinates from addresses present in company lists.
Language: Python.
Notable Libraries: geopy (library that drastically simplifies the use of multiple geocoding APIs).
Process: Addresses read out from Excel file, Nominatim API first queried. If Nominatim was not able to find a valid coordinate (usualy ~30% of the time), Google Maps v3 would then be used to find the rest. Google Maps would not be used for the entire list as there is a monthly quota (40k queries), while Nominatim, while less robust, is entirely free.
Result: Was ran along with Program Set 5 just prior to final mapping.
The Cascadia Clean Technology research project was undertaken to create a more wholistic picture of the regions strengths and expertise in the green technology sector, to better understand sector concentrations, and identify some of the many producers of goods/technology, services, and accelerators operating here. Representation evolved into a digital map for ease of visual impact and sorting. The map structure created is intended to allow revisions, updates, and additions to reflect the evolving nature of this sector and in recognition that the current representation is not exhaustive. The Project was undertaken by a Virtual Student Federal Service intern team, guided by the U.S. Commercial Service.
Overview
The climate technology industry is new, complex, and growing. In order to support American exports from this growing industry, this team’s goal was to create a map that would make it easier to find and learn about important companies in the Washington and Oregon climate technology space. Our focus was especially towards companies that could be potential exporters from the United States.
In order to create an accurate map, we needed to determine what types of companies fell under the umbrella of green technology, and how we could most usefully sort them. At first we considered using a system already created by a consulting group, database, and/or research paper. However, we soon determined that the limited systems that had already been created were either not comprehensive enough or detailed enough for our purposes. Therefore, we created our own based on research into the industry and with inspiration from other systems.
Our system is also intended to be useful for those searching for specific company types in the map, and can be applied to green technology sectors across the United States. Our categories are:
- Energy
- Energy - Energy Storage
- Energy - Clean energy generation.
- Energy - Clean energy generation - Solar
- Energy - Clean energy generation - Hydro
- Energy - Clean energy generation - Geothermal
- Energy - Clean energy generation - Nuclear
- Energy - Clean energy generation - Wind
- Energy - Clean energy generation - Tidal
- Energy - Clean energy generation - Biomass
- Financial Services
- Consulting Services
- Software/IT
- Transportation
- Pollution Management
- Pollution Monitoring (non GHG)
- Low Carbon/ Low Greenhouse Gas Emissions
- Manufacturing and production efficiency
- Clean Construction/Building
Process
To begin our research into the climate technology industry, we searched for information online. We found various definitions of climate technology, some being more restrictive while others being very general. We then searched for different climate tech companies, research efforts, and organizations in the industry in Washington and Oregon, which gave us valuable information and leads. Online research gave us a good feel of the industry so we could further develop the map, but we wanted to dig deeper.
To further understand the climate tech industry, we sought to interview experts and leaders in the industry in Washington and Oregon, sought out thought leaders and speakers from climate technology focused webinars and events. We also utilized LinkedIn and other sources to find more industry experts to assist in our research. To set up informational interviews, we reached out to these individuals through cold emails.
From the responses we got to our emails, we set up informational interviews via Zoom to ask questions about the industry and tailored questions to the interviewee’s expertise. Examples of questions we asked include “What do you think is changing the most about climate tech currently?” and “What are the biggest challenges in the climate tech industry right now?”. In addition to climate tech industry questions, we asked about our plans for the map to figure out how the map can be used and what features they would find useful.
The interviews revealed a wealth of information about climate tech that greatly improved our knowledge of the industry, specifically in Washington and Oregon. We learned about the next big things in climate tech, how companies are changing to adapt to environmental sustainability, and what challenges the industry is currently facing. Specifically, energy storage, offshore wind, zero-emission heavy-duty transportation, and sustainable construction were mentioned as some of the fastest-growing sectors in the Pacific Northwest. Buildings are responsible for about 40% of carbon emissions, yet the standard for clean construction has only gotten stricter. Combined with the unique construction needs for different types of buildings, it is difficult to streamline clean building construction. Additional industry challenges include a lack of proper government support and the risks associated with making large investments in the industry.
Hearing various opinions on our map was a huge help for our project as well. The interviewees responded positively to the idea of the map and said that it would be a valuable resource. They also proposed different uses for the map that helped us understand what our users are going to look like. One interviewee loved the idea of using the map to find funding agencies to aid in climate technology research and the development of new products. Another said that it would be useful to find manufacturers to help supply chain efforts, as finding sustainable manufacturers and contractors can be difficult.
Our research through internet resources and interviews advanced the project and improved our knowledge of the climate tech industry. By understanding the different forces and players within the industry, we could better tailor our map to potential users. We now understand that climate technology covers more sectors than we thought and that our map will have many practical uses.
Program Sets and Data Sources
Data sources:
A variety of company databases were utilized to construct our lists which included, but not limited to, Orbis, Hoovers, Apollo, USASpending, and Crunchbase and via simple web searching.
Filtering:
All of the larger datasets pulled from Orbis, USASpending, etc. that had a high percentage of entities that did not meet our criteria were initially filtered using scoring/validation programs to identify entities that were most likely to be good fits for our criteria. In the case of lists of entities that had received federal funding in the form of grants and contracts pulled from USASpending and Salesforce lists that were pre-filtered to be cleantech-oriented, minimal manual filtering was needed. However, for the lists pulled from Orbis, due to there being minimal company descriptions provided by the service and the very broad initial filtering by industry codes, extensive manual checking was needed on top of the automated scoring done by programs.
Attribute-Finding:
The vast majority of attributes (contact information, address, etc.) were unable to be found in a non-manual fashion due to both the absence of a single reference for such information and the often dated state of such sources. Thus, finding information on entities was mainly done manually, and such details are captured in the following write-up.
Program Sets:
Program Set 1: Link 1 Link 2
Purpose: Extract keywords from company pages given Excel file of company URLs pulled from company databases, and attempt to use those keywords to narrow our potential “wanted” list.
Language: Python.
Notable Libraries: pandas (read in Excel sheet), requests (to query the API), xlwt (to write output to another Excel workbook).
Process: Used the TextRazor Natural Language Processing API to classify each company’s homepage by a predefined set of entities and topics.
Result: Not used. Database-provided classification (SIC) was used to narrow company lists.
Purpose: Help to narrow down company lists by assigning each company a “greenness score.”
Language: Python.
Notable Libraries:
- Selenium. This library gives full programmatic control of a typical browser when paired with a “webdriver” file. The Firefox browser was chosen since it is pre-installed on Linux distributions. Execution times were dramatically cut by disabling unnecessary browser features, JavaScript, and images, as well as installing ad blocker extensions and setting a timeout for unresponsive/nonexistent websites so that time would not be wasted waiting for an error.
- BeautifulSoup (older versions). This is a raw HTML parser library that allows easy extraction of content embedded within HTML/XML files. Even though this is much faster than selenium, as there is no browser overhead, it cannot handle dynamic content.
- Pandas, xlwt as previously mentioned.
- PyAutoGUI. Allows programmatic control of keyboard and mouse, used to highlight and copy text on pages.
- Re. Provides regex matching functionality for keywords.
Result: Program was run on both Oregon and Washington lists. Headless (not requiring GUI) version was made for easy deployment on personal server. Estimated ~20k companies per day processing capability, spread out across virtual machines, physical machines, and servers.
In addition to company databases, at this point, I also pulled lists of Oregon and Washington entities that had received federal grants/contracts from the USASpending website. These lists were tremendously huge, dwarfing even the Washington company lists as pulled from the databases previously (~113k), running into the millions of rows.
Program Set 2: Link 1 Link 2
Purpose: Help to narrow down company lists by assigning each company a “greenness score.”
Language: Python.
Notable Libraries:
- Selenium. This library gives full programmatic control of a typical browser when paired with a “webdriver” file. The Firefox browser was chosen since it is pre-installed on Linux distributions. Execution times were dramatically cut by disabling unnecessary browser features, JavaScript, and images, as well as installing ad blocker extensions and setting a timeout for unresponsive/nonexistent websites so that time would not be wasted waiting for an error.
- BeautifulSoup (older versions). This is a raw HTML parser library that allows easy extraction of content embedded within HTML/XML files. Even though this is much faster than selenium, as there is no browser overhead, it cannot handle dynamic content.
- Pandas, xlwt as previously mentioned.
- PyAutoGUI. Allows programmatic control of keyboard and mouse, used to highlight and copy text on pages.
- Re. Provides regex matching functionality for keywords.
Result: Program was run on both Oregon and Washington lists. Headless (not requiring GUI) version was made for easy deployment on personal server. Estimated ~20k companies per day processing capability, spread out across virtual machines, physical machines, and servers.
In addition to company databases, at this point, I also pulled lists of Oregon and Washington entities that had received federal grants/contracts from the USASpending website. These lists were tremendously huge, dwarfing even the Washington company lists as pulled from the databases previously (~113k), running into the millions of rows.
Program Set 3: Link
Purpose: Help narrow down USASpending lists by applying regex terms found through Grace’s research.
Language: Python.
Notable Libraries: re, again for regex matching.
Process: Assign greenness score using the same method as in Program Set 2, but this time based on award descriptions provided by USASpending for each grant/contract awarded to each entity.
Result: Was much faster than Program Set 2, since it did not require any web functionality. Successfully processed all lists by post-fourth meeting.
Program Set 4: Link
Purpose: Validate email addresses given a Salesforce data pull using MX records.
Language: Python, Excel VBA.
Process: Email addresses read in one by one, domains were stripped from email addresses and DNS lookups were run on those domains. If no matching DNS record and/or no matching MX record (the special type of DNS record that allows mail to be received), the listing was marked as invalid.
Result: Used successfully. Excel VBA version was created (~Meeting 16) to better conform to Department IT policy if it was indeed deployed internally.
As Tableau and other solutions were proving to be too complex or too feature-limited to display geographic data with extensive filtering capabilities, I began exploring other types of maps, like cluster and tree maps.
Program Set 5: Link
Purpose: Automate scraping of NAICS codes from MailingLists NAICS Lookup given company name, city, and state.
Language: Python.
Notable Libraries: Selenium, pandas (now used for both reading in Excel file and outputting), PyAutoGUI (again, copy all text on results page), re (pare down text on results page to extract NAICS code).
Process: Selenium drives Firefox to visit the MailingLIsts NAICS Lookup page, company name and city extracted from Excel sheet and website form filled out. Results page scraped, NAICS code extracted using re and written to output Excel file.
Result: MailingLists changed their website, making this program obsolete.
Program Set 6: Link 1 Link 2
Purpose: Query Nominatim and Google Maps geocoding APIs for coordinates from addresses present in company lists.
Language: Python.
Notable Libraries: geopy (library that drastically simplifies the use of multiple geocoding APIs).
Process: Addresses read out from Excel file, Nominatim API first queried. If Nominatim was not able to find a valid coordinate (usualy ~30% of the time), Google Maps v3 would then be used to find the rest. Google Maps would not be used for the entire list as there is a monthly quota (40k queries), while Nominatim, while less robust, is entirely free.
Result: Was ran along with Program Set 5 just prior to final mapping.