Please be advised that this Google Cloud Storage - Storage Insights LookML code is currently under development and subject to change. As customer needs evolve, the LookML code may be modified accordingly.
Please be aware that this dashboard may consume significant BigQuery resources, potentially incurring additional costs.
This dashboard is exclusively compatible with your Storage Insights linked dataset within BigQuery. Any attempt to utilize this dashboard with an alternative database or data warehouse will result in query errors.
This dashboard utilizes two Persistent Derived Tables (PDTs). To ensure successful operation, it is imperative to configure your connection to enable PDT functionality within Looker.
In case you need some help with the connection, plase refer to this Looker Documentation: BigQuery
For questions/issues and feedback, reach out to insights-customer-support@google.com.
- We are excited for customers to use the Datasets feature and share feedback.
- If you are experiencing any issues related to permissions configuration and timeliness of reports availability. The team is available at insights-customer-support@google.com to help troubleshoot and address any concerns.
- Enabling these Datasets for your production environment (GCP Projects) will not have any operational impact on the GCS Bucket storage/serving workloads. However you will be applying new permissions to production resources so please plan accordingly.
- Datasets generated can be configured to capture data from different GCP Regions and consolidate into a single Linked Dataset in BigQuery. Please take that into consideration if your metadata has regional/sovereignty requirements. Configuring one Dataset per sovereign region is one option available to help address these requirements.
This Looker block empowers GCS users to gain comprehensive insights from large datasets by seamlessly integrating their own data with Looker. Through regular audits, it enables the identification of potential vulnerabilities, opportunities, and the execution of advanced analytics tasks.
In order to obtain these high level dashboards this block creates:
- Two Persistent derived tables - Object Attributes and Bucket Attributes to filter the Snapshot Time. The PDT's are created inside your defined looker_scratch_schema, and are regenerated once the table Events view inside BigQuery has a different row count.
- One derived table that will be joined to all the available views. This derived table contains the GCP regions information. This table helps us understand geolocation based on GCP regions and country positions.
- Different parameters to determine the row - columns format.
To successfully integrate GCS Storage Insights data into BigQuery, please refer to the detailed instructions provided in the Storage Insights User Guide.
The tables are currently divided into five different options:
- bucket_attributes_view
- error_attributes_view
- events_view
- object_attributes_view
- project_attributes_view
Inside this view we can see metadata about the available buckets inside your projects or organization.
This table has the following columns:
- snapshotTime: The snapshot time of the object metadata in RFC 3339.
- name: The name of the source bucket.
- location: Object data for objects in the bucket resides in physical storage within this region or multi-region. Defaults to "US". See Cloud Storage bucket locations for the authoritative list.
- project: The project number of the project the bucket belongs to.
- storageClass: The bucket's default storage class, used whenever no storageClass is specified for a newly-created object. If storageClass is not specified when the bucket is created, it defaults to "STANDARD". For available storage classes, see Storage classes.
- versioning:The bucket's versioning configuration. For more information, see Object Versioning.
- lifecycle: The bucket's lifecycle configuration status. See lifecycle management for more information.
- metageneration: The metadata generation of this bucket.
- timeCreated: The creation time of the bucket in RFC 3339 format.
- public: The bucket's IAM configuration.
- public.bucketPolicyOnly
- public.publicAccessPrevention: The bucket's public access prevention status , which is either "inherited" or "enforced". If "inherited", the bucket uses public access prevention only if the bucket is subject to the public access prevention organization policy constraint. Defaults to "inherited".
- autoclass: The bucket's Autoclass configuration, which, when enabled, controls the storage class of objects based on how and when the objects are accessed.
- autoclass.enabled: Whether or not Autoclass is enabled. By default, this boolean is not set, and Autoclass is disabled.
- autoclass.toggleTime: The time at which Autoclass was last enabled or disabled for this bucket, in RFC 3339 format.
- softDeletePolicy: The bucket’s soft delete policy, which defines the period of time that soft-deleted objects will be retained, and cannot be permanently deleted.
- softDeletePolicy.effectiveTime: The time from which the policy, or one with a greater retention duration, was effective. This value is in RFC 3339 format.
Note that the bucket metadata is cached for 10 seconds. This means that for 10 seconds after the effective time, soft-deleted objects may be subject to the old soft delete policy or the new soft delete policy. - softDeletePolicy.retentionDurationSeconds: The period of time in seconds that soft-deleted objects in the bucket will be retained and cannot be permanently deleted.
The value must be greater than or equal to 604,800 seconds (7 days) and less than 7,776,000 seconds (90 days). This value follows the retention period time unit conventions.
- softDeletePolicy.effectiveTime: The time from which the policy, or one with a greater retention duration, was effective. This value is in RFC 3339 format.
- tags: Tags are key-value pairs you can apply to your resources for fine-grained access control. See Bucket Tags for more details.
- tags.lastUpdatedTime
- tags.tagMap
- tags.tagMap.key
- tags.tagMap.value
- labels: User-provided bucket labels , in key/value pairs.
- labels.key: An individual label entry key.
- labels.value: An individual label entry value.
This view will display the errors related to GCS
This table has the following columns:
- errorCode: The error code associated with this entry. See below for currently supported error codes.
- errorSource: The source of the error (currently only CONFIGURATION_PREPROCESSING).
- errorTime: The time the error happened.
- sourceGcsLocation: The source GCS location of the error. For projects this field is null given they are locationless.
- bucketErrorRecord: Record with information needed to debug a bucket error.
- bucketErrorRecord.bucketName: The name of the bucket involved in the error.
- bucketErrorRecord.serviceAccount: The service account that needs permission to ingest objects from the bucket.
- projectErrorRecord: Record with information needed to debug a project error.
- projectErrorRecord.projectNumber: The number of the project involved in the error.
- projectErrorRecord.organizationName: The name of the organization the project must belong to in order to be processed.
Error Code | Error Case | Error Message | Troubleshooting |
---|---|---|---|
1 | Source project not in organization error | "Source project projectErrorRecord.projectNumber not in organization projectErrorRecord.organizationName." | "Add source project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName." |
2 | Bucket authorization error | "Permission denied for ingesting objects for bucket bucketErrorRecord.bucketName." | "Give service account bucketErrorRecord.serviceAccount IAM permissions to allow ingestion of objects for bucket bucketErrorRecord.bucketName." |
3 | Destination project not in organization error | "Destination project projectErrorRecord.projectNumber not in organization projectErrorRecord.organizationName." | "Add destination project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName." |
4 | Source project not Management Hub entitled | "Source project projectErrorRecord.projectNumber is not Management Hub entitled." | "Configure Management Hub for source project projectErrorRecord.projectNumber." |
5 | Bucket not Management Hub entitled | "Bucket bucketErrorRecord.bucketName is not Management Hub entitled." | "Configure Management Hub for bucket bucketErrorRecord.bucketName." |
This view will let you know the exact timestamp that your data was appended to the Object Attributes or Bucket Attributes views.
This table has the following columns:
- manifest
- location:GCS source location.
- snapshotTime: snapshotTime of the manifest row.
- viewName:Table category associated with the row. Currently it supports ‘bucket_attributes_view’ and ‘object_attributes_view’.
- eventTime: Time of completion of the event. The standard currently is the BigQuery job end time (precision to milliseconds level).
- eventCode: Unique code given to each type of event.
Inside this view we can see metadata about the available objects inside your projects or organization.
This table has the following columns:
- snapshotTime: The snapshot time of the object metadata in RFC 3339 format.
- bucket: The name of the bucket containing this object.
- location: The location of the source bucket.
- componentCount: Returned for composite objects only. Number of non-composite objects in the composite object. componentCount includes non-composite objects that were part of any composite objects used to compose the current object. Note: Composite objects do not have an MD5 hash metadata field.
- contentDisposition: Content-Disposition of the object data.
- contentEncoding: Content-Encoding of the object data.
- contentLanguage: Content-Language of the object data.
- contentLanguage: Content-Type of the object data. If an object is stored without a Content-Type, it is served as application/octet-stream.
- crc32c: CRC32c checksum, as described in RFC 4960, Appendix B; encoded using base64 in big-endian byte order. For more information about using the CRC32c checksum, see Hashes and eTags: best practices.
- customTime: A user-specified timestamp for the object in RFC 3339 format. Once set on an object, customTime cannot be removed and cannot be set to an earlier datetime. For more information, see custom time metadata.
- etag: HTTP 1.1 Entity tag for the object.
- eventBasedHold: Whether or not the object is subject to an event-based hold.
- generation: The content generation of this object. Used for object versioning.
- md5Hash: MD5 hash of the data, encoded using base64. This field is not present for composite objects. For more information about using the MD5 hash, see Hashes and ETags: Best Practices.
- mediaLink: A URL for downloading the object's data. You should generally use one of the other JSON API endpoints instead.
- metageneration: The version of the metadata for this object at this generation. Used for preconditions and for detecting changes in metadata. A metageneration number is only meaningful in the context of a particular generation of a particular object.
- name: The name of the object. Required if not specified by URL parameter.
- selfLink: A URL for this object. You should generally use one of the other JSON API endpoints instead.
- size: Content-Length of the data in bytes.
- storageClass: Storage class of the object. To change an object's storage class, use objects rewrite.
- temporaryHold: Whether or not the object is subject to a temporary hold.
- timeCreated: The creation time of the object in RFC 3339 format.
- timeDeleted: The deletion time of the object in RFC 3339 format. Returned if and only if this version of the object is no longer a live version, but remains in the bucket as a noncurrent version.
- updated: The modification time of the object metadata in RFC 3339 format. Set initially to object creation time and then updated whenever any metadata of the object changes. This includes changes made by a requester, such as modifying custom metadata, as well as changes made by Cloud Storage on behalf of a requester, such as changing the storage class based on an Object Lifecycle Configuration.
- timeStorageClassUpdated: The time at which the object's storage class was last changed. When the object is initially created, it will be set to timeCreated.
- retentionExpirationTime: The earliest time that the object can be deleted, which depends on any retention configuration set for the object and any retention policy set for the bucket that contains the object. The value for retentionExpriationTime is given in RFC 3339 format.
- softDeleteTime: If this object has been soft-deleted, this is the time at which it became soft-deleted.
- hardDeleteTime: This is the time (in the future) when the object will no longer be restorable. It is equal to the soft delete time plus the soft delete retention duration of the bucket. Note that the hard delete time will not change in response to future changes to the bucket soft delete retention duration.
This property is only set for soft-deleted objects. - metadata: User-provided metadata, in key/value pairs.
- metadata.key: An individual metadata entry key.
- metadata.value: An individual metadata entry value.
Inside this view we can see metadata about the available projects inside your projects or organization.
This table has the following columns:
- snapshotTime: The snapshot time of the object metadata in RFC 3339 format.
- project: The project number
- projectName: The project name
- projectID: The project ID
This table has the following columns:
- location: The location of the source bucket.
- geographic_area: The Gographic Area is determined by the bucket's location, as specified in the official documentation for locationSingle-Region locations. The available zones are categorized into these regions: North America, South America, Europe, Asia, India, Indonesia, the Middle East, Australia, and Africa
- location_type: The location type is contingent upon the region configuration. It may be Single-Region, Dual-Region, or Multi-Region.
- countries: The different countries' combinations that are based on the selected region.
- state: The different state combinations that are based on the selected region.
- city: The different city combinations that are based on the selected region.
- latitude: The latitude of the selected region.
- longitude: The longitude of the selected region.
To explain these CTEs that are going to loaded inside a Persistent Derived Tavle, we should start with the following SQL statement:
WITH
distinct_snapshots AS (
SELECT
DISTINCT snapshotTime
FROM
`@{project_id}.@{bigquery_dataset}.object_attributes_view`
WHERE
snapshotTime IS NOT NULL
INTERSECT DISTINCT
SELECT
DISTINCT snapshotTime
FROM
`@{project_id}.@{bigquery_dataset}.bucket_attributes_view`
WHERE
snapshotTime IS NOT NULL)
Subsequently, we will retrieve the latest bucket and object attributes by filtering for the MAX(snapshotTime).
bucket_attributes_latest AS (
SELECT
*
FROM
`@{project_id}.@{bigquery_dataset}.bucket_attributes_view`
WHERE
snapshotTime = (
SELECT
MAX(snapshotTime)
FROM
distinct_snapshots
)
object_attributes_latest AS (
SELECT
*
FROM
`@{project_id}.@{bigquery_dataset}.object_attributes_view`
WHERE
snapshotTime = (
SELECT
MAX(snapshotTime)
FROM
distinct_snapshots
)
To understand the datagroup value used by Looker, it is essential to understand the datagroup's SQL trigger:
sql_trigger:
WITH st_total AS (
SELECT
DATE(manifest.snapshotTime) AS snapshotTime,
COUNT(*) AS total
FROM
`@{project_id}.@{bigquery_dataset}.events_view`
WHERE
manifest.viewName = 'object_attributes_view' AND
manifest.snapshotTime = (
SELECT MAX(manifest.snapshotTime)
FROM `@{project_id}.@{bigquery_dataset}.events_view`
)
GROUP BY 1)
SELECT CONCAT(CAST(snapshotTime AS STRING), ' | ', CAST(total AS STRING)) AS combined_value
FROM st_total;;
Inside this section, we are going to explain some general steps to connect your Google Cloud Storage Insights dataset with Looker.
- Enable the Google Cloude Storage - Storage Insights Dataset by following the detailed instructions provided in the Storage Insights User Guide.
- Create the connection with Looker:
- On your BigQuery database, configure the authentication that Looker will use to access your BigQuery database. Looker supports the following authentication options for BigQuery:
- Service account: See the Authentication with BigQuery service accounts section on this page for information.
- OAuth: See the Authentication with OAuth section on this page for information.
- On your BigQuery database create a temporary dataset that Looker can use to create persistent derived tables (PDTs) on your database. See the section Creating a temporary dataset for persistent derived tables on this page for the procedure.
- Note: If you are using OAuth and you want to enable PDTs on the connection, you need to create a BigQuery service account specifically for Looker to use for PDT processes on your BigQuery database. See the section Enabling PDTs for Looker connections to BigQuery with OAuth on this page for the procedure.
- In Looker, set up the Looker connection to your BigQuery database. See the section Connecting Looker to BigQuery on this page for the procedure.
- In Looker, test the connection between Looker and your BigQuery database. See the section Testing the connection on this page for the procedure.
- On your BigQuery database, configure the authentication that Looker will use to access your BigQuery database. Looker supports the following authentication options for BigQuery:
- IF YOU ARE USING A CUSTOMER - HOSTED INSTANCE: Follow steps 3 and 4 of this guide. IF YOU HAVE ACCESS TO THE MARKETPLACE, follow step 5. Download the GCS - Storage Insights LookML code from the Looker Open Source Git Repository:
- Inside the GCS - Storage Insights - Looker Open Source Git Repository, click on the 'CODE' button and download the ZIP with all the LookML files
- As soon as the files are available, you can upload those documents to your own enterprise Git repository. At the end of this document you will find a Bash Script to upload the LookML files to your own GitHub repository.
- Create a New LookMl project:
- Verify that you are in Development Mode.
- Select Projects from the Develop section of the navigation panel.
- On the LookML Projects page, select New LookML Project to select the options for your project.
- Name the project: gcs_storage_insights and select: "Create Blank Project"
- Upload all the files and folders to the created project.
- Inside the manifest you will see the following code snippet. Please update these values with the corresponding names for your looker connection name, GCP project ID and BigQuery dataset ID:
constant: connection_name { value: "your_connection_name" export: override_optional } constant: project_id { value: "your_project_id" export: override_optional } constant: bigquery_dataset { value: "your_bigquery_dataset" export: override_optional }
- To install the GCS - Storage Insights Dashboard, navigate to your Looker instance's marketplace. Locate the dashboard, click 'Install,' and provide the connection details when prompted.
- Visualize the dashboards under the LookML dashboard folder