The Grouparoo Blog
The data source is the location of the data that the processing will consume for data processing functions. This can be the point of origin of the data, the place of its creation. Alternatively, this can be data generated by another process and then made available for subsequent processing. Therefore, the source data may be raw, unfiltered, and unrefined, or polished and fully formed. In this article, we'll look at the definition of data sources and their general types.
The foundation of data processing is the collection and transformation of data. Data represents the information in its preprocessed form, a group of unorganized facts and figures available for processing for a specific purpose. Data sources can be internal or external, primary or secondary. The critical factor for any data source is providing the data consumer with a standardized access method.
- Internal Data Sources are those within and under the control of the organization that is performing the processing. External Data Sources are those outside the organization that is executing the processing. For example, service agreements may cover data quality, latency, and availability, but they are outside the organization's control.
- Primary Data Sources are those where data collection is from its point of creation before any processing. Conversely, Secondary Data Sources are those where data collection is from a point following a form of processing. The quality and validity of the data are directly dependent on the processing functions.
Data source examples include sales figures from retail outlets, stock levels from warehousing facilities, sales forecast figures from the sales team, and cash flow data from the finance team. These examples include internal and external secondary data sources.
There's a range of distinct types of data sources, the more commonly used types are:
Databases are now a ubiquitous part of business processes for managing data. The answer to what is a database source is a structured pool of data where database management systems grant access on request to the data processor, either in whole or in part. The structure of databases tends to depend on each vendor's proprietary implementation, though for data processing, the database's internal structure typically has a limited impact on processing functions.
The benefit of databases over other data storage techniques is the strong access controls that protect data integrity where concurrent access by multiple data processors exists.
Databases can hold a copy of the data, fulfilling the role of data source, or implementation may be as a relational database where it maintains a pointer to the data source. This is an important distinction as it determines whether or not the data source is under the control of the database function.
Extract, transform, load (ETL) processes typically provide the means to generate databases of internal secondary data from an internal or external primary source of data to facilitate data processing.
A data stream is a set of information from a data provider received sequentially from the data source at a time and rate that the data provider determines. When accessing a data stream, the data processor has no control over the arrival of data.
The use of flat files such as Microsoft Excel is commonplace for data with limited structural requirements, offering an accessible and manually interpretable secondary data source for businesses. Data contained within a file is available by accessing the shared file stored in a repository or other accessible location. For concurrent access of modifiable data, access control mechanisms are essential to prevent data integrity issues.
The most basic form of a data source is information manually entered by a user, be that event-based information, data typed into a field, or a completed web form. Typically, such information may transition through a database or other data store for access as secondary data by the data processor.
Equipment ranging from simple sensors to complex operational technology may generate information as a data source. This can vary from raw data extracted directly from the equipment or an intermediatory data source such as a data stream.
There are a variety of techniques available to access a data source, governed by the nature and type of the data source.
Standardized network protocols, including Hypertext Transfer Protocol (HTTP) and its extension Hypertext Transfer Protocol Secure (HTTPS), provide the mechanism to access and collect data across external networks, including the Internet. Other, more specialized data transfer protocols such as the File Transfer Protocol (FTP), its extension Secure File Transfer Protocol (SFTP), and Simple Mail Transfer Protocol (SMTP) are also available.
Application Programming Interfaces (API) provide bespoke methods of accessing a data source. The API defines the communications requirements that the data processor must comply with to access and collect data. Both internal and external data sources can use APIs to provide direct access to data or provide transformational processing if necessary. APIs are popular as they enable the enforcement of security controls such as authentication and authorization check to protect data confidentiality.
Big data is the specialist field that deals with the systematic collection and processing of information from data sets that are too large or complex for traditional data-processing techniques to manage. Big data allows advanced analytical and behavioral processes, though the data's volume, variety, and velocity add challenges to the data processor.
- Traditional data sources such as relational databases and data streams cannot manage big data, so they have alternative data sources.
- Big data sources examples include customer analytical data that creates a 360-degree view, including behavior, demographics, and transactional information. Such information can evolve and refine the customer experience to improve service levels and drive marketing activities.
- Other examples include industrial machinery performance data supporting production workflow optimization, preventative maintenance scheduling, and support forecasting.
- Cloud-based storage is the most usual form of big data source, thanks to its scalability. However, frameworks for distributed storage are also available for the decomposition of big data into manageable chunks.
The answer to “what are data sources” is straightforward. Data sources are the location of data that processing functions consume for defined purposes. The most usual form of a data source is databases and their associated database management systems for data processing.
There are a wide variety of data sources; which ones are most appropriate depends on the purpose of the data processing function. For example, multiple diverse data sources may provide the data used by the processing function for complex processing.
The data sourcing may be live data, a copy of the data, or even a pointer to data stored in a different location. It may be raw data, validated data, or big data. It may be the creation point for data, or it may be a repository of processed data. The data source may be internal or external to the organization that is performing the processing. The critical point is that for the data processor, the data source is the location where it can access the data it requires in an agreed format.
When doing Reverse ETL with Grouparoo, data sources are most commonly data warehouses. Grouparoo is a modern reverse ETL data pipeline tool that enables you to leverage the data you already have in your data warehouse to make better-informed business decisions. It’s easy to set up and use and integrates with a wide selection of CRMs, data warehouses, databases, ad platforms, and SaaS marketing tools.
Tagged in Data
See all of Stephen Mash's posts.
Stephen is a UK-based freelance technology writer with a background in system development and assurance, primarily focused on high-integrity applications.
Learn more about Stephen @ https://www.linkedin.com/in/steve-mash-exosure