Skip to content

looker-open-source/gcs-storage-intelligence-block

Google Cloud Storage - Storage Insights - Looker Dashboard

Please be advised that this Google Cloud Storage - Storage Insights LookML code is currently under development and subject to change. As customer needs evolve, the LookML code may be modified accordingly.

Please be aware that this dashboard may consume significant BigQuery resources, potentially incurring additional costs.

This dashboard is exclusively compatible with your Storage Insights linked dataset within BigQuery. Any attempt to utilize this dashboard with an alternative database or data warehouse will result in query errors.

This dashboard utilizes two Persistent Derived Tables (PDTs). To ensure successful operation, it is imperative to configure your connection to enable PDT functionality within Looker.

In case you need some help with the connection, plase refer to this Looker Documentation: BigQuery

For questions/issues and feedback, reach out to insights-customer-support@google.com.

Key considerations

  1. We are excited for customers to use the Datasets feature and share feedback.
  2. If you are experiencing any issues related to permissions configuration and timeliness of reports availability. The team is available at insights-customer-support@google.com to help troubleshoot and address any concerns.
  3. Enabling these Datasets for your production environment (GCP Projects) will not have any operational impact on the GCS Bucket storage/serving workloads. However you will be applying new permissions to production resources so please plan accordingly.
  4. Datasets generated can be configured to capture data from different GCP Regions and consolidate into a single Linked Dataset in BigQuery. Please take that into consideration if your metadata has regional/sovereignty requirements. Configuring one Dataset per sovereign region is one option available to help address these requirements.
# Google Cloud Storage Insights ## What does this Looker Block do for me?

This Looker block empowers GCS users to gain comprehensive insights from large datasets by seamlessly integrating their own data with Looker. Through regular audits, it enables the identification of potential vulnerabilities, opportunities, and the execution of advanced analytics tasks.

In order to obtain these high level dashboards this block creates:

  • Two Persistent derived tables - Object Attributes and Bucket Attributes to filter the Snapshot Time. The PDT's are created inside your defined looker_scratch_schema, and are regenerated once the table Events view inside BigQuery has a different row count.
  • One derived table that will be joined to all the available views. This derived table contains the GCP regions information. This table helps us understand geolocation based on GCP regions and country positions.
  • Different parameters to determine the row - columns format.

To successfully integrate GCS Storage Insights data into BigQuery, please refer to the detailed instructions provided in the Storage Insights User Guide.

GCS Storage Insights Data Structure

The tables are currently divided into five different options:

  • bucket_attributes_view
  • error_attributes_view
  • events_view
  • object_attributes_view
  • project_attributes_view

Bucket Attributes

Inside this view we can see metadata about the available buckets inside your projects or organization.


This table has the following columns:


  • snapshotTime: The snapshot time of the object metadata in RFC 3339.
  • name: The name of the source bucket.
  • location: Object data for objects in the bucket resides in physical storage within this region or multi-region. Defaults to "US". See Cloud Storage bucket locations for the authoritative list.
  • project: The project number of the project the bucket belongs to.
  • storageClass: The bucket's default storage class, used whenever no storageClass is specified for a newly-created object. If storageClass is not specified when the bucket is created, it defaults to "STANDARD". For available storage classes, see Storage classes.
  • versioning:The bucket's versioning configuration. For more information, see Object Versioning.
  • lifecycle: The bucket's lifecycle configuration status. See lifecycle management for more information.
  • metageneration: The metadata generation of this bucket.
  • timeCreated: The creation time of the bucket in RFC 3339 format.
  • public: The bucket's IAM configuration.
  • autoclass: The bucket's Autoclass configuration, which, when enabled, controls the storage class of objects based on how and when the objects are accessed.
    • autoclass.enabled: Whether or not Autoclass is enabled. By default, this boolean is not set, and Autoclass is disabled.
    • autoclass.toggleTime: The time at which Autoclass was last enabled or disabled for this bucket, in RFC 3339 format.
  • softDeletePolicy: The bucket’s soft delete policy, which defines the period of time that soft-deleted objects will be retained, and cannot be permanently deleted.
    • softDeletePolicy.effectiveTime: The time from which the policy, or one with a greater retention duration, was effective. This value is in RFC 3339 format.
      Note that the bucket metadata is cached for 10 seconds. This means that for 10 seconds after the effective time, soft-deleted objects may be subject to the old soft delete policy or the new soft delete policy.
    • softDeletePolicy.retentionDurationSeconds: The period of time in seconds that soft-deleted objects in the bucket will be retained and cannot be permanently deleted.
      The value must be greater than or equal to 604,800 seconds (7 days) and less than 7,776,000 seconds (90 days). This value follows the retention period time unit conventions.
  • tags: Tags are key-value pairs you can apply to your resources for fine-grained access control. See Bucket Tags for more details.
    • tags.lastUpdatedTime
    • tags.tagMap
      • tags.tagMap.key
      • tags.tagMap.value
  • labels: User-provided bucket labels , in key/value pairs.
    • labels.key: An individual label entry key.
    • labels.value: An individual label entry value.

Error Attributes

This view will display the errors related to GCS


This table has the following columns:


  • errorCode: The error code associated with this entry. See below for currently supported error codes.
  • errorSource: The source of the error (currently only CONFIGURATION_PREPROCESSING).
  • errorTime: The time the error happened.
  • sourceGcsLocation: The source GCS location of the error. For projects this field is null given they are locationless.
  • bucketErrorRecord: Record with information needed to debug a bucket error.
    • bucketErrorRecord.bucketName: The name of the bucket involved in the error.
    • bucketErrorRecord.serviceAccount: The service account that needs permission to ingest objects from the bucket.
  • projectErrorRecord: Record with information needed to debug a project error.
    • projectErrorRecord.projectNumber: The number of the project involved in the error.
    • projectErrorRecord.organizationName: The name of the organization the project must belong to in order to be processed.
Error Code Error Case Error Message Troubleshooting
1 Source project not in organization error "Source project projectErrorRecord.projectNumber not in organization projectErrorRecord.organizationName." "Add source project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName."
2 Bucket authorization error "Permission denied for ingesting objects for bucket bucketErrorRecord.bucketName." "Give service account bucketErrorRecord.serviceAccount IAM permissions to allow ingestion of objects for bucket bucketErrorRecord.bucketName."
3 Destination project not in organization error "Destination project projectErrorRecord.projectNumber not in organization projectErrorRecord.organizationName." "Add destination project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName."
4 Source project not Management Hub entitled "Source project projectErrorRecord.projectNumber is not Management Hub entitled." "Configure Management Hub for source project projectErrorRecord.projectNumber."
5 Bucket not Management Hub entitled "Bucket bucketErrorRecord.bucketName is not Management Hub entitled." "Configure Management Hub for bucket bucketErrorRecord.bucketName."

Events

This view will let you know the exact timestamp that your data was appended to the Object Attributes or Bucket Attributes views.

This table has the following columns:

  • manifest
    • location:GCS source location.
    • snapshotTime: snapshotTime of the manifest row.
    • viewName:Table category associated with the row. Currently it supports ‘bucket_attributes_view’ and ‘object_attributes_view’.
  • eventTime: Time of completion of the event. The standard currently is the BigQuery job end time (precision to milliseconds level).
  • eventCode: Unique code given to each type of event.

Object Attributes

Inside this view we can see metadata about the available objects inside your projects or organization.
This table has the following columns:

  • snapshotTime: The snapshot time of the object metadata in RFC 3339 format.
  • bucket: The name of the bucket containing this object.
  • location: The location of the source bucket.
  • componentCount: Returned for composite objects only. Number of non-composite objects in the composite object. componentCount includes non-composite objects that were part of any composite objects used to compose the current object. Note: Composite objects do not have an MD5 hash metadata field.
  • contentDisposition: Content-Disposition of the object data.
  • contentEncoding: Content-Encoding of the object data.
  • contentLanguage: Content-Language of the object data.
  • contentLanguage: Content-Type of the object data. If an object is stored without a Content-Type, it is served as application/octet-stream.
  • crc32c: CRC32c checksum, as described in RFC 4960, Appendix B; encoded using base64 in big-endian byte order. For more information about using the CRC32c checksum, see Hashes and eTags: best practices.
  • customTime: A user-specified timestamp for the object in RFC 3339 format. Once set on an object, customTime cannot be removed and cannot be set to an earlier datetime. For more information, see custom time metadata.
  • etag: HTTP 1.1 Entity tag for the object.
  • eventBasedHold: Whether or not the object is subject to an event-based hold.
  • generation: The content generation of this object. Used for object versioning.
  • md5Hash: MD5 hash of the data, encoded using base64. This field is not present for composite objects. For more information about using the MD5 hash, see Hashes and ETags: Best Practices.
  • mediaLink: A URL for downloading the object's data. You should generally use one of the other JSON API endpoints instead.
  • metageneration: The version of the metadata for this object at this generation. Used for preconditions and for detecting changes in metadata. A metageneration number is only meaningful in the context of a particular generation of a particular object.
  • name: The name of the object. Required if not specified by URL parameter.
  • selfLink: A URL for this object. You should generally use one of the other JSON API endpoints instead.
  • size: Content-Length of the data in bytes.
  • storageClass: Storage class of the object. To change an object's storage class, use objects rewrite.
  • temporaryHold: Whether or not the object is subject to a temporary hold.
  • timeCreated: The creation time of the object in RFC 3339 format.
  • timeDeleted: The deletion time of the object in RFC 3339 format. Returned if and only if this version of the object is no longer a live version, but remains in the bucket as a noncurrent version.
  • updated: The modification time of the object metadata in RFC 3339 format. Set initially to object creation time and then updated whenever any metadata of the object changes. This includes changes made by a requester, such as modifying custom metadata, as well as changes made by Cloud Storage on behalf of a requester, such as changing the storage class based on an Object Lifecycle Configuration.
  • timeStorageClassUpdated: The time at which the object's storage class was last changed. When the object is initially created, it will be set to timeCreated.
  • retentionExpirationTime: The earliest time that the object can be deleted, which depends on any retention configuration set for the object and any retention policy set for the bucket that contains the object. The value for retentionExpriationTime is given in RFC 3339 format.
  • softDeleteTime: If this object has been soft-deleted, this is the time at which it became soft-deleted.
  • hardDeleteTime: This is the time (in the future) when the object will no longer be restorable. It is equal to the soft delete time plus the soft delete retention duration of the bucket. Note that the hard delete time will not change in response to future changes to the bucket soft delete retention duration.
    This property is only set for soft-deleted objects.
  • metadata: User-provided metadata, in key/value pairs.
    • metadata.key: An individual metadata entry key.
    • metadata.value: An individual metadata entry value.

Project Attributes

Inside this view we can see metadata about the available projects inside your projects or organization.


This table has the following columns:
  • snapshotTime: The snapshot time of the object metadata in RFC 3339 format.
  • project: The project number
  • projectName: The project name
  • projectID: The project ID

Regions Information CTE

The Regions Information Common Table Expression (CTE) provides foundational geographical information for all analyses. It joins the location column to approximate latitude and longitude for each region, identifies associated countries, and facilitates map generation using a predefined GeoJSON file.
This table has the following columns:
  • location: The location of the source bucket.
  • geographic_area: The Gographic Area is determined by the bucket's location, as specified in the official documentation for locationSingle-Region locations. The available zones are categorized into these regions: North America, South America, Europe, Asia, India, Indonesia, the Middle East, Australia, and Africa
  • location_type: The location type is contingent upon the region configuration. It may be Single-Region, Dual-Region, or Multi-Region.
  • countries: The different countries' combinations that are based on the selected region.
  • state: The different state combinations that are based on the selected region.
  • city: The different city combinations that are based on the selected region.
  • latitude: The latitude of the selected region.
  • longitude: The longitude of the selected region.
# Bucket and Object Attributes Persistent Derived Tables
To optimize performance for large Bucket Attributes and Object Attributes tables, we implemented a method to retrieve the latest information without querying the original tables. This data is stored in a looker-scratch-schema within a separate dataset, preconfigured during the gcs-storage-insights dataset setup in Looker.
To explain these CTEs that are going to loaded inside a Persistent Derived Tavle, we should start with the following SQL statement:

WITH
    distinct_snapshots AS (
        SELECT
          DISTINCT snapshotTime
        FROM
          `@{project_id}.@{bigquery_dataset}.object_attributes_view`
        WHERE
          snapshotTime IS NOT NULL
        INTERSECT DISTINCT
        SELECT
          DISTINCT snapshotTime
        FROM
          `@{project_id}.@{bigquery_dataset}.bucket_attributes_view`
        WHERE
          snapshotTime IS NOT NULL)

This distinct_snapshots table let us obtain the all the different snapshots along the Bucket and Objects tables. This will help us to obtain the current or the latest snapshot in subsequent queries.
Subsequently, we will retrieve the latest bucket and object attributes by filtering for the MAX(snapshotTime).
bucket_attributes_latest AS (
        SELECT
          *
        FROM
          `@{project_id}.@{bigquery_dataset}.bucket_attributes_view`
        WHERE
          snapshotTime = (
            SELECT
              MAX(snapshotTime)
            FROM
              distinct_snapshots
          )
object_attributes_latest AS (
        SELECT
          *
        FROM
          `@{project_id}.@{bigquery_dataset}.object_attributes_view`
        WHERE
          snapshotTime = (
            SELECT
              MAX(snapshotTime)
            FROM
              distinct_snapshots
          )
These Derived Tables will be materialized as actual tables within your Looker scratch schema dataset. To control the table creation process, both table types will utilize a datagroup parameter: gcs_storage_bucket_attributes_datagroup for the Bucket Attributes and gcs_storage_object_attributes_datagroup for the Object Attributes tables.
To understand the datagroup value used by Looker, it is essential to understand the datagroup's SQL trigger:
sql_trigger:
    WITH st_total AS (
      SELECT
        DATE(manifest.snapshotTime) AS snapshotTime,
        COUNT(*) AS total
      FROM
        `@{project_id}.@{bigquery_dataset}.events_view`
      WHERE
        manifest.viewName = 'object_attributes_view' AND
        manifest.snapshotTime = (
          SELECT MAX(manifest.snapshotTime)
          FROM `@{project_id}.@{bigquery_dataset}.events_view`
        )
      GROUP BY 1)
    SELECT CONCAT(CAST(snapshotTime AS STRING), ' | ', CAST(total AS STRING)) AS combined_value
    FROM st_total;;
The SQL statement defines a CTE, st_total, which compares the manifest.snapshotTime and the aggregated count of events from objects_attributes_view. If either the snapshotTime or the event count changes, the datagroup's sql_trigger signals Looker to regenerate the table with the latest data.

Connection

Inside this section, we are going to explain some general steps to connect your Google Cloud Storage Insights dataset with Looker.

  1. Enable the Google Cloude Storage - Storage Insights Dataset by following the detailed instructions provided in the Storage Insights User Guide.
  2. Create the connection with Looker:
    1. On your BigQuery database, configure the authentication that Looker will use to access your BigQuery database. Looker supports the following authentication options for BigQuery:
    2. On your BigQuery database create a temporary dataset that Looker can use to create persistent derived tables (PDTs) on your database. See the section Creating a temporary dataset for persistent derived tables on this page for the procedure.
      • Note: If you are using OAuth and you want to enable PDTs on the connection, you need to create a BigQuery service account specifically for Looker to use for PDT processes on your BigQuery database. See the section Enabling PDTs for Looker connections to BigQuery with OAuth on this page for the procedure.
    3. In Looker, set up the Looker connection to your BigQuery database. See the section Connecting Looker to BigQuery on this page for the procedure.
    4. In Looker, test the connection between Looker and your BigQuery database. See the section Testing the connection on this page for the procedure.
  3. IF YOU ARE USING A CUSTOMER - HOSTED INSTANCE: Follow steps 3 and 4 of this guide. IF YOU HAVE ACCESS TO THE MARKETPLACE, follow step 5. Download the GCS - Storage Insights LookML code from the Looker Open Source Git Repository:
    1. Inside the GCS - Storage Insights - Looker Open Source Git Repository, click on the 'CODE' button and download the ZIP with all the LookML files
    2. As soon as the files are available, you can upload those documents to your own enterprise Git repository. At the end of this document you will find a Bash Script to upload the LookML files to your own GitHub repository.
  4. Create a New LookMl project:
    1. Verify that you are in Development Mode.
    2. Select Projects from the Develop section of the navigation panel.
    3. On the LookML Projects page, select New LookML Project to select the options for your project.
    4. Name the project: gcs_storage_insights and select: "Create Blank Project"
    5. Upload all the files and folders to the created project.
    6. Inside the manifest you will see the following code snippet. Please update these values with the corresponding names for your looker connection name, GCP project ID and BigQuery dataset ID:
                  
                      constant: connection_name {
                          value: "your_connection_name"
                          export: override_optional
                      }
                      constant: project_id {
                          value: "your_project_id"
                          export: override_optional
                      }
                      constant: bigquery_dataset {
                          value: "your_bigquery_dataset"
                          export: override_optional
                      }
                  
                  
  5. To install the GCS - Storage Insights Dashboard, navigate to your Looker instance's marketplace. Locate the dashboard, click 'Install,' and provide the connection details when prompted.
  6. Visualize the dashboards under the LookML dashboard folder

Coming Soon

This section details the projects that will be integrated into Storage Insights:


We are currently developing additional features that will provide complementary insights to your Google Cloud Storage analysis.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages