Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Error while parsing queries from json file #57

Closed
siva-mudiyanur opened this issue Jul 8, 2021 · 42 comments
Closed

Error while parsing queries from json file #57

siva-mudiyanur opened this issue Jul 8, 2021 · 42 comments
Labels
Support Support for running data-lineage

Comments

@siva-mudiyanur
Copy link

I was able to successfully load catalog using dbcat but I'm geting the following error when I tried to parse queries using a file in json format(I also tried the given test file)

File "~/Python/3.8/lib/python/site-packages/data_lineage/parser/init.py", line 124, in parse
name = str(hash(sql))
TypeError: unhashable type: 'dict'

Here's line 124:

name = str(hash(sql))

Code executed:

from dbcat import catalog_connection
from data_lineage.parser import parse_queries, visit_dml_query
import json

with open("queries2.json", "r") as file:
    queries = json.load(file)

catalog_conf = """
catalog:
  user:test
  password: t@st
  host: 127.0.0.1
  port: 5432
  database: postgres
"""
catalog = catalog_connection(catalog_conf)

parsed = parse_queries(queries)

visited = visit_dml_query(catalog, parsed)
@siva-mudiyanur
Copy link
Author

I think changing from parse(query) to parse(query['query'],query['name']) might work?

parsed.append(parse(query))

@vrajat
Copy link
Member

vrajat commented Jul 8, 2021

Looks like you are using the older version of the notebook but a newer version of the json file. Your change should work though.

@siva-mudiyanur
Copy link
Author

the new version of the notebook seems to be taking credentials and building catalog on the fly, I wanted to utilize the catalog generated by dbcat..please correct me if I'm wrong.

@vrajat
Copy link
Member

vrajat commented Jul 8, 2021

catalog.scan_source is the one that pulls the data.

I haven't removed any of the APIs. You can continue to use the example you are comfortable with. I intend to keep them unchanged and support their usage.

Refer to

For official APIs without using a server, REST and sdk.

@siva-mudiyanur
Copy link
Author

I tried using the latest docker file..when I tried to execute the sample notebook, it gave me the following error:

Traceback (most recent call last):
  File "~/Packages/User/lin_test.py", line 27, in <module>
    source = catalog.add_source(name="dev", source_type="redshift", **wikimedia_db)
  File "~/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 379, in add_source
    payload = self._post(path="sources", data=data, type="sources")
  File "~/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 202, in _post
    response.raise_for_status()
  File "/Library/Python/3.8/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: http://127.0.0.1:8000/api/v1/catalog/sources
  1. When I look at the log for data-lineage-visualizer, I see multiple lines with the same error:

2021/07/09 16:45:09 [error] 25#25: *12 connect() failed (113: Host is unreachable) while connecting to upstream, client:

  1. The data lineage app has been continuously restarting with the following error:
import psycopg2

File "/opt/pysetup/.venv/lib/python3.8/site-packages/psycopg2/__init__.py", line 51, in <module>

from psycopg2._psycopg import ( # noqa

ImportError: libpq.so.5: cannot open shared object file: No such file or directory

On a side note, I'm planning on doing the server implementation without using docker, I thought I'll try to see if I can achieve using the docker file first.

@vrajat
Copy link
Member

vrajat commented Jul 10, 2021

Ugh! psycopg2 packaging is giving me problems. Can you please download a previous version of the container and try?

@siva-mudiyanur
Copy link
Author

siva-mudiyanur commented Jul 11, 2021

I tried the updated docker file, receiving the following error when I execute the sample notebook:

requests.exceptions.HTTPError: 400 Client Error: BAD REQUEST for url: http://127.0.0.1:8000/api/v1/catalog/sources

Looks like one of the sql statements is looking for a missing column called source_id to compare against parameters:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedColumn) column jobs.source_id does not exist

LINE 1: ....name AS jobs_name, jobs.context AS jobs_context, jobs.sourc...

[SQL: SELECT jobs.id AS jobs_id, jobs.name AS jobs_name, jobs.context AS jobs_context, jobs.source_id AS jobs_source_id

FROM jobs

WHERE %(param_1)s = jobs.source_id]

[parameters: {'param_1': 8}]

@vrajat
Copy link
Member

vrajat commented Jul 11, 2021

Can you recreate the database and recreate it with the latest dbcat? Looks like the database has an older version of the schema.

@siva-mudiyanur
Copy link
Author

I recreated the db within the container through CLI..was able to catalog but query parsing seems to be failing:

Traceback (most recent call last):
  File "~/Library/Application Support/Sublime Text 3/Packages/User/lin_test.py", line 47, in <module>
    parser.parse(**query, source=source)    
  File "~/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 548, in parse
    response.raise_for_status()
  File "/Library/Python/3.8/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 441 Client Error: UNKNOWN for url: http://127.0.0.1:8000/api/v1/parse?query=INSERT+INTO+test_schema.test_table+Select+%2A+from+test_staging_schema.test_staging_table&source_id=13

Code Executed:

from data_lineage import Catalog
import json
from data_lineage import Parse

docker_address = "http://127.0.0.1:8000"
wikimedia_db = {
  "username": "test",
  "password": "t@st!",
  "uri": "",
  "port": "",
  "database": "test"
}

catalog = Catalog(docker_address)

source = catalog.add_source(name="tst", source_type="redshift", **wikimedia_db)

catalog.scan_source(source)

with open("queries.json", "r") as file:
    queries = json.load(file)

parser = Parse(docker_address)

for query in queries:
    print(query)
    parser.parse(**query, source=source)    

@vrajat
Copy link
Member

vrajat commented Jul 12, 2021

Can you paste the output of docker container logs <container id of tokern/data-lineage>.

441 Error code is TableNotFound exception.

Most probably the catalog does not have information about test_staging_schema.test_staging_table. Can you also paste the output of :

catalog.get_table(source_name='tst', schema_name='test_staging_schema', table_name='test_staging_table')`

@siva-mudiyanur
Copy link
Author

siva-mudiyanur commented Jul 12, 2021

Thanks for all your support.

You're right..looks like the catalog is not being loaded through docker(even though catalog.scan_source executes successfully), the catalog tables were empty and only the connection details were added into source table. Any suggestions on how to point Catalog to local host or have the catalog loaded in docker? I tried multiple solutions for the former(by editing docker compose file/data lineage engine file) but nothing worked.

To be specific, to connect to local catalog: I tried adding localhost, host.docker.internal to CATALOG_HOST, also modified postgres_conf.sql to listen to all IP addresses and modified pg_hba.conf to include all IP addresses.

Container log:

Traceback (most recent call last):

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect

return fn()

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 304, in unique_connection

return _ConnectionFairy._checkout(self)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout

fairy = _ConnectionRecord.checkout(pool)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 495, in checkout

rec = pool._do_get()

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 140, in _do_get

self._dec_overflow()

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__

compat.raise_(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_

raise exception

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 137, in _do_get

return self._create_connection()

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection

return _ConnectionRecord(self)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 440, in __init__

self.__connect(first_connect_check=True)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 661, in __connect

pool.logger.debug("Error on connect(): %s", e)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__

compat.raise_(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_

raise exception

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 656, in __connect

connection = pool._invoke_creator(self)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect

return dialect.connect(*cargs, **cparams)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 508, in connect

return self.dbapi.connect(*cargs, **cparams)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect

conn = _connect(dsn, connection_factory=connection_factory, **kwasync)

psycopg2.OperationalError: could not translate host name "<uri>" to address: Temporary failure in name resolution

Output of catalog.get_table:

Traceback (most recent call last):
  File "/Users/Library/Application Support/Sublime Text 3/Packages/User/lin_test.py", line 34, in <module>
    catalog.get_table(source_name='tst', schema_name='test_staging_schema', table_name='test_staging_table')
  File "/Users/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 332, in get_table
    schema = self.get_schema(source_name, schema_name)
  File "/Users/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 319, in get_schema
    raise SchemaNotFound(
data_lineage.SchemaNotFound: Schema not found, (source_name=tst, schema_name= test_staging_schema)

@siva-mudiyanur
Copy link
Author

On a lighter note..I did a google search for 441, I should have looked through the code :)

@vrajat
Copy link
Member

vrajat commented Jul 13, 2021

If you want to use an external Postgres database, replace the following parameters in tokern-lineage-engine.yml:

  • CATALOG_HOST
  • CATALOG_USER
  • CATALOG_PASSWORD
  • CATALOG_DB

@vrajat
Copy link
Member

vrajat commented Jul 13, 2021

@siva-mudiyanur
Copy link
Author

If you want to use an external Postgres database, replace the following parameters in tokern-lineage-engine.yml:

  • CATALOG_HOST
  • CATALOG_USER
  • CATALOG_PASSWORD
  • CATALOG_DB

This was my first approach but it wasn't working..Here are my observations:

  1. catalog.add_source() was adding source values into demo catalog despite of values given for external host in tokern-lineage-engine.yml file.
  2. catalog.scan_source() was erroring out with the same error shown above: psycopg2.OperationalError: could not translate host name "" to address: Temporary failure in name resolution

Values provided for External Catalog:

  CATALOG_PASSWORD: t@st_passw0rd
  CATALOG_USER: catalog_test
  CATALOG_DB: tokern
  CATALOG_HOST: "127.0.0.1"

@vrajat
Copy link
Member

vrajat commented Jul 13, 2021

Ref: https://dev.to/natterstefan/docker-tip-how-to-get-host-s-ip-address-inside-a-docker-container-5anh

Can you change to:

CATALOG_HOST: host.docker.internal # For mac or windows

CATALOG_HOST: 172.17.0.1 # Linux

@siva-mudiyanur
Copy link
Author

Thanks...I use Mac primarily and I have already tried using host.docker.internal..I also tried to playaround with networks listed in compose file..tokern-net and tokern-internal, nothing helped.

Also changing the catalog_host value in tokern_lineage_engine.yml file doesn't seem to cause any effect, looks like only parameters from compose file(docker-compose.yml) are being considered.

I'll keep trying and post an update if I was able to get any success.

@vrajat
Copy link
Member

vrajat commented Jul 14, 2021

There were bugs in both the docker-compose files and in command line parsing. I've fixed them and there should be a new release published in pypi and docker hub in a few minutes.

@siva-mudiyanur
Copy link
Author

Thanks @vrajat ..I just tried the latest file, app was failing after pulling the latest dock file with additional parameters as given in the instructions..It worked when I replaced the catalog_host parameter manually on tokern_lineage_engine.yml file and added "tokern-net" to list of networks for tokern-api

Couple of issues:

  1. The given example notebook - json file is accepted with only "query" in it("name" should be removed)
  2. Looks like query is being parsed but there seems to be communication error between visualizer and local host:
2021/07/14 04:43:52 [error] 25#25: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 10.10.0.1, server: , request: "POST /api/v1/catalog/sources HTTP/1.1", upstream: "http://10.11.0.4:4142/api/v1/catalog/sources", host: "127.0.0.1:8000"

2021/07/14 04:45:32 [error] 25#25: *3 connect() failed (113: Host is unreachable) while connecting to upstream, client: 10.10.0.1, server: , request: "GET /api/main HTTP/1.1", upstream: "http://10.11.0.4:4142/api/main", host: "127.0.0.1:8000", referrer: "http://127.0.0.1:8000/"

@vrajat
Copy link
Member

vrajat commented Jul 14, 2021

Can you paste the output of 'docker container logs ' ? The visualizer is not able to connect to the api. So it must be shutting down due to errors

@siva-mudiyanur
Copy link
Author

I restarted the the container..this time, it gave me no errors but I do not see any updated visuals at 127.0.0.1:8000

Visualizer log:

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration

/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/

/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh

10-listen-on-ipv6-by-default.sh: info: IPv6 listen already enabled

/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh

/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh

/docker-entrypoint.sh: Configuration complete; ready for start up

10.10.0.1 - - [14/Jul/2021:05:38:42 +0000] "GET /api/v1/catalog/sources?filter%5Bobjects%5D=%5B%7B%22name%22%3A+%22name%22%2C+%22op%22%3A+%22eq%22%2C+%22val%22%3A+%22rs%22%7D%5D HTTP/1.1" 200 2370 "-" "python-requests/2.25.1"

10.10.0.1 - - [14/Jul/2021:05:38:42 +0000] "POST /api/v1/parse?query=INSERT+INTO+test_agg.test+Select+%2A+from+test_staging.test&source_id=1 HTTP/1.1" 200 422 "-" "python-requests/2.25.1"

Lineage log:

[2021-07-14 05:37:46 +0000] [7] [INFO] Starting gunicorn 20.1.0

[2021-07-14 05:37:46 +0000] [7] [INFO] Listening at: http://0.0.0.0:4142 (7)

[2021-07-14 05:37:46 +0000] [7] [INFO] Using worker: sync

[2021-07-14 05:37:46 +0000] [53] [INFO] Booting worker with pid: 53

[2021-07-14 05:38:08 +0000] [9] [INFO] Starting gunicorn 20.1.0

[2021-07-14 05:38:08 +0000] [9] [INFO] Listening at: http://0.0.0.0:4142 (9)

[2021-07-14 05:38:08 +0000] [9] [INFO] Using worker: sync

[2021-07-14 05:38:08 +0000] [55] [INFO] Booting worker with pid: 55

@siva-mudiyanur
Copy link
Author

I see that this error log got appended on lineage log after I posted the previous comment( would have took about ~3-5 mins)

ERROR:data_lineage.server:Exception on /api/main [GET]

Traceback (most recent call last):

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context

self.dialect.do_execute(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute

cursor.execute(statement, parameters)

psycopg2.OperationalError: server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.



The above exception was the direct cause of the following exception:


Traceback (most recent call last):

File "/opt/pysetup/.venv/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request

rv = self.dispatch_request()

File "/opt/pysetup/.venv/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request

return self.view_functions[rule.endpoint](**req.view_args)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/flask_restful/__init__.py", line 467, in wrapper

resp = resource(*args, **kwargs)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/flask/views.py", line 89, in view

return self.dispatch_request(*args, **kwargs)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/flask_restful/__init__.py", line 582, in dispatch_request

resp = meth(*args, **kwargs)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/server.py", line 69, in get

column_edges = self._catalog.get_column_lineages(args["job_ids"])

File "/opt/pysetup/.venv/lib/python3.8/site-packages/dbcat/catalog/catalog.py", line 299, in get_column_lineages

return query.all()

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3373, in all

return list(self)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3535, in __iter__

return self._execute_and_instances(context)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3560, in _execute_and_instances

result = conn.execute(querycontext.statement, self._params)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1011, in execute

return meth(self, multiparams, params)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection

return connection._execute_clauseelement(self, multiparams, params)

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement

ret = self._execute_context(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1316, in _execute_context

self._handle_dbapi_exception(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception

util.raise_(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_

raise exception

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context

self.dialect.do_execute(

File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute

cursor.execute(statement, parameters)

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.



[SQL: SELECT column_lineage.id AS column_lineage_id, column_lineage.context AS column_lineage_context, column_lineage.source_id AS column_lineage_source_id, column_lineage.target_id AS column_lineage_target_id, column_lineage.job_execution_id AS column_lineage_job_execution_id, sources_1.id AS sources_1_id, sources_1.source_type AS sources_1_source_type, sources_1.name AS sources_1_name, sources_1.dialect AS sources_1_dialect, sources_1.uri AS sources_1_uri, sources_1.port AS sources_1_port, sources_1.username AS sources_1_username, sources_1.password AS sources_1_password, sources_1.database AS sources_1_database, sources_1.instance AS sources_1_instance, sources_1.cluster AS sources_1_cluster, sources_1.project_id AS sources_1_project_id, sources_1.project_credentials AS sources_1_project_credentials, sources_1.page_size AS sources_1_page_size, sources_1.filter_key AS sources_1_filter_key, sources_1.included_tables_regex AS sources_1_included_tables_regex, sources_1.key_path AS sources_1_key_path, sources_1.account AS sources_1_account, sources_1.role AS sources_1_role, sources_1.warehouse AS sources_1_warehouse, schemata_1.id AS schemata_1_id, schemata_1.name AS schemata_1_name, schemata_1.source_id AS schemata_1_source_id, tables_1.id AS tables_1_id, tables_1.name AS tables_1_name, tables_1.schema_id AS tables_1_schema_id, columns_1.id AS columns_1_id, columns_1.name AS columns_1_name, columns_1.data_type AS columns_1_data_type, columns_1.sort_order AS columns_1_sort_order, columns_1.table_id AS columns_1_table_id, sources_2.id AS sources_2_id, sources_2.source_type AS sources_2_source_type, sources_2.name AS sources_2_name, sources_2.dialect AS sources_2_dialect, sources_2.uri AS sources_2_uri, sources_2.port AS sources_2_port, sources_2.username AS sources_2_username, sources_2.password AS sources_2_password, sources_2.database AS sources_2_database, sources_2.instance AS sources_2_instance, sources_2.cluster AS sources_2_cluster, sources_2.project_id AS sources_2_project_id, sources_2.project_credentials AS sources_2_project_credentials, sources_2.page_size AS sources_2_page_size, sources_2.filter_key AS sources_2_filter_key, sources_2.included_tables_regex AS sources_2_included_tables_regex, sources_2.key_path AS sources_2_key_path, sources_2.account AS sources_2_account, sources_2.role AS sources_2_role, sources_2.warehouse AS sources_2_warehouse, schemata_2.id AS schemata_2_id, schemata_2.name AS schemata_2_name, schemata_2.source_id AS schemata_2_source_id, tables_2.id AS tables_2_id, tables_2.name AS tables_2_name, tables_2.schema_id AS tables_2_schema_id, columns_2.id AS columns_2_id, columns_2.name AS columns_2_name, columns_2.data_type AS columns_2_data_type, columns_2.sort_order AS columns_2_sort_order, columns_2.table_id AS columns_2_table_id, jobs_1.id AS jobs_1_id, jobs_1.name AS jobs_1_name, jobs_1.context AS jobs_1_context, jobs_1.source_id AS jobs_1_source_id, job_executions_1.id AS job_executions_1_id, job_executions_1.job_id AS job_executions_1_job_id, job_executions_1.started_at AS job_executions_1_started_at, job_executions_1.ended_at AS job_executions_1_ended_at, job_executions_1.status AS job_executions_1_status

FROM column_lineage LEFT OUTER JOIN columns AS columns_1 ON columns_1.id = column_lineage.source_id LEFT OUTER JOIN tables AS tables_1 ON tables_1.id = columns_1.table_id LEFT OUTER JOIN schemata AS schemata_1 ON schemata_1.id = tables_1.schema_id LEFT OUTER JOIN sources AS sources_1 ON sources_1.id = schemata_1.source_id LEFT OUTER JOIN columns AS columns_2 ON columns_2.id = column_lineage.target_id LEFT OUTER JOIN tables AS tables_2 ON tables_2.id = columns_2.table_id LEFT OUTER JOIN schemata AS schemata_2 ON schemata_2.id = tables_2.schema_id LEFT OUTER JOIN sources AS sources_2 ON sources_2.id = schemata_2.source_id LEFT OUTER JOIN job_executions AS job_executions_1 ON job_executions_1.id = column_lineage.job_execution_id LEFT OUTER JOIN jobs AS jobs_1 ON jobs_1.id = job_executions_1.job_id]

(Background on this error at: http://sqlalche.me/e/13/e3q8)

@siva-mudiyanur
Copy link
Author

@vrajat
Copy link
Member

vrajat commented Jul 14, 2021

Interesting. I did not hit these issues. Seems specific to Mac. I'll see how I can add make these changes appropriately.

@vrajat
Copy link
Member

vrajat commented Jul 14, 2021

I've uploaded a potential fix by setting pool_pre_ping=True. I have no way to test it. Is it possible for you to test?

Instructions:

git clone --single-branch --branch lost_connection https://github.com/vrajat/data-lineage.git
cd data-lineage
./docker/build-image.sh lost_conn

Then replace image: tokern/data-lineage:latest with image: tokern/data-lineage:lost_conn in tokern-lineage-engine.yml.

@siva-mudiyanur
Copy link
Author

I'm new to docker, build_image.sh wasn't working..so I executed the command that I interpreted from the shell script..ended up with the following error:

failed to compute cache key: "/docker/docker-entrypoint.sh" not found: not found

Wasn't able to find anything from google..please let me know if you have any suggestions.

Log:

~/data-lineage/Docker > docker build -t "tokern/data-lineage:lost_conn" -f Dockerfile . 
[+] Building 1.6s (18/19)                                                       
 => [internal] load build definition from Dockerfile                       0.0s
 => => transferring dockerfile: 84B                                        0.0s
 => [internal] load .dockerignore                                          0.0s
 => => transferring context: 2B                                            0.0s
 => [internal] load metadata for docker.io/library/python:3.8.1-slim       1.0s
 => [auth] library/python:pull token for registry-1.docker.io              0.0s
 => [internal] load build context                                          0.0s
 => => transferring context: 247B                                          0.0s
 => CACHED [python-base 1/1] FROM docker.io/library/python:3.8.1-slim@sha  0.0s
 => CANCELED [builder-base 1/9] RUN apt-get update     && apt-get install  0.5s
 => CANCELED [production 1/4] RUN apt-get update     && apt-get install -  0.4s
 => CACHED [builder-base 2/9] RUN curl -sSL https://raw.githubusercontent  0.0s
 => CACHED [builder-base 3/9] WORKDIR /opt/pysetup                         0.0s
 => ERROR [builder-base 4/9] COPY ./poetry.lock ./pyproject.toml ./        0.0s
 => CACHED [builder-base 5/9] RUN poetry install --no-dev  # respects      0.0s
 => CACHED [builder-base 6/9] WORKDIR /src                                 0.0s
 => CACHED [builder-base 7/9] COPY . .                                     0.0s
 => CACHED [builder-base 8/9] RUN poetry build                             0.0s
 => CACHED [builder-base 9/9] RUN pip install dist/data_lineage-*.whl      0.0s
 => CACHED [production 2/4] COPY --from=builder-base /opt/pysetup/.venv /  0.0s
 => ERROR [production 3/4] COPY ./docker/docker-entrypoint.sh /docker-ent  0.0s
------
 > [builder-base 4/9] COPY ./poetry.lock ./pyproject.toml ./:
------
------
 > [production 3/4] COPY ./docker/docker-entrypoint.sh /docker-entrypoint.sh:
------
failed to compute cache key: "/docker/docker-entrypoint.sh" not found: not found

@vrajat
Copy link
Member

vrajat commented Jul 14, 2021

You should run the command from the parent directory.

docker build -t "tokern/data-lineage:lost_conn" -f docker/Dockerfile .

@siva-mudiyanur
Copy link
Author

Thanks..I was able to successfully install and I think I figured out whats going on:

  1. When I tried to parse a query, it first fired a query against catalog db to look for schemas on the table 'default_schema' using 'source_id' column. I had the catalog loaded through dbcat which loaded the schemas into "schemata" table..not sure why 'default_schema' isn't loaded(probably due to restricted access on my side). Since, I do not have any schemas listed on default_schema table..I think data-lineage engine hasn't found any rows and issued a ROLLBACK on catalog db ( I looked at pg_stat_activity table on catalog db to figure out).
  2. Since I wasn't seeing any progress after quite sometime, I tried to rerun the process and it caused the error 'sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly'

Next steps:

I'll try to use the scan_source() to see if it helps.

@vrajat
Copy link
Member

vrajat commented Jul 15, 2021

You need to call an API to set default schema:

catalog.update_schema(source_id, schema_id)

This is a new feature to find tables without schema mentioned. I'll document it today. Sorry about that. Also it shouldn't fail if there is no default schema set. I'll look into that as well.

Did the fix for operational error help? If yes, I'll commit it.

@siva-mudiyanur
Copy link
Author

yes..the fix definitely helped the app to be resilient.

May I know what would be schema_id for default schema or the update_schema()? I'll go through the code as well to figure out

@siva-mudiyanur
Copy link
Author

siva-mudiyanur commented Jul 16, 2021

Couple of issues:

  1. update_schema() is adding a default schema but the return seems to be failing:
Traceback (most recent call last):
  File "~/Library/Application Support/Sublime Text 3/Packages/User/lin_test.py", line 46, in <module>
    catalog.update_source(source,schema)
  File "~/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 489, in update_source
    attributes=payload["attributes"],
KeyError: 'attributes'

Code executed:

source = catalog.get_source("rs")
schema = catalog.get_schema("rs", "test")
catalog.update_source(source,schema)
  1. Query is sent to the server for parsing but no update in UI and also no logs generated on lineage app as well.

Last line from visualizer log:
10.10.0.1 - - [16/Jul/2021:15:53:08 +0000] "POST /api/v1/parse?query=<query>&source_id=7 HTTP/1.1" 200 181 "-" "python-requests/2.25.1"

*removed the query as its pretty long

@vrajat
Copy link
Member

vrajat commented Jul 17, 2021

I've pushed a new release with fixes for the connection error as well as update_source. Can you please move to the latest 0.7.8 docker?

I am also adding tests for the docker image before deploying as well as update documentation. Sorry for all the trouble.

On on no lineage being visualized, can you paste logs from tokern-api?

@siva-mudiyanur
Copy link
Author

siva-mudiyanur commented Jul 18, 2021

No probs..Here's the log, there's no sign of errors:

[2021-07-18 09:09:11 +0000] [8] [INFO] Starting gunicorn 20.1.0
[2021-07-18 09:09:11 +0000] [8] [INFO] Listening at: http://0.0.0.0:4142 (8)
[2021-07-18 09:09:11 +0000] [8] [INFO] Using worker: sync
[2021-07-18 09:09:11 +0000] [54] [INFO] Booting worker with pid: 54

When I try to execute the same code sometime later, I started to receive the same error which was fixed for in the current version:

ERROR:data_lineage.server:Exception on /api/v1/catalog/sources [GET]

Traceback (most recent call last):
File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
self.dialect.do_execute(
File "/opt/pysetup/.venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
cursor.execute(statement, parameters)
psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request. 

@vrajat
Copy link
Member

vrajat commented Jul 19, 2021

I'll try to run a stress test on my machine and reproduce the OperationalError. In the meantime, I can give you instructions to run the API server and visualizer. Do you want to give that a try?

@siva-mudiyanur
Copy link
Author

sure..that would help, thanks!

@vrajat
Copy link
Member

vrajat commented Jul 25, 2021

I am using this opportunity to improve docs. I have added instructions here: https://tokern.io/docs/data-lineage/installation Can you please check if it helps you run in native mode?

I'll also be grateful if you have any feedback on docs in general. I hope to add more info in the next few days.

@siva-mudiyanur
Copy link
Author

Thank you! I'll try out this week and give you an update

@siva-mudiyanur
Copy link
Author

Hi..Sorry for the delay, I tried the native mode and I had to make the following updates:

  1. Update SQLAlchemy for data-lineage to successfully run (Lost the history in terminal, I wish I could be more specific)
  2. Before using npm start, I had to install node(we can install using either brew or download through website) and required node modules(react-script, babel, run-all).

I have ended up with the following error after making the setup:

Traceback (most recent call last):
  File "/Users/siva/Library/Application Support/Sublime Text 3/Packages/User/lineage_test.py", line 48, in <module>
    source = catalog.get_source(name='poc')
  File "/Users/siva/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 311, in get_source
    payload = self._search_one("sources", filters)
  File "/Users/siva/Library/Python/3.8/lib/python/site-packages/data_lineage/__init__.py", line 226, in _search_one
    response.raise_for_status()
  File "/Library/Python/3.8/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://127.0.0.1:4141/api/v1/catalog/sources?filter%5Bobjects%5D=%5B%7B%22name%22%3A+%22name%22%2C+%22op%22%3A+%22eq%22%2C+%22val%22%3A+%22poc%22%7D%5D

Code executed:

from data_lineage import Catalog
import json
from data_lineage import Parse
from dbcat import catalog_connection
from dbcat import pull

docker_address = "http://127.0.0.1:4141"
catalog = Catalog(docker_address)
source = catalog.get_source(name='poc')

with open("queries5.json", "r") as file:
    queries = json.load(file)
parser = Parse(docker_address)
for query in queries:
    print(query)
    parser.parse(**query, source=source)    

@siva-mudiyanur
Copy link
Author

siva-mudiyanur commented Aug 16, 2021

Also, I got the following error on the UI..I'm assuming it will be resolved once I'm able to successfully use data lineage:

Edit: Hitting on Esc returned to visualization UI
image

@vrajat
Copy link
Member

vrajat commented Aug 17, 2021

Looks like the server has an error. Can you make sure you are running the latest version ? v0.8.x ?

Are there any logs from the server? It should print logs to stderr.

@vrajat vrajat added the Support Support for running data-lineage label Oct 22, 2021
@vrajat
Copy link
Member

vrajat commented Oct 22, 2021

Closing as there is no activity.

@vrajat vrajat closed this as completed Oct 22, 2021
@siva-mudiyanur
Copy link
Author

Sorry..haven't got a chance to work on this, I will definitely take it up later and give some feedback

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Support Support for running data-lineage
Projects
None yet
Development

No branches or pull requests

2 participants