Discussion:
Parsing and Iterating large JSON data sets
(too old to reply)
b***@gmail.com
2017-08-21 17:52:34 UTC
Permalink
Hello,

Been an avid TDI/SDI'er for quite some time now, but running up against a wall on this particular endeavor. I have a REST API service that allows me to pull all users of a given environment. The return is of a JSON string.

Now, normally, I'd just use a standard iterator in this case and augment it to what I need, but REST seems to make my eyes glaze over from time to time. In this case, the REST API requires extra header information than what is available for an HTTPClient connector in iterator mode.

I have tried two options, building my custom connector in iterator mode (which "works") or copying the output JSON response to a file and then attempting to iterate the file. Note that the size of this JSON is 20MB when stored in a flat file.

My own connector initializes and performs entry selection. The API is returning data in chunked encoding, so I have to wait for all entries to be stored in memory and then begin iteration. However, TDI does not like that after a few seconds in trying to iterate the finalized response. I guess that makes sense as it trying to store a massive JSON string in memory and then parse it.

Alternatively, if I run the flat file connector, I still chew up a fair amount of memory. Still working on the iteration piece to see if it's viable.

I guess my question boils down to is if either way is viable due to the size of the JSON string returned? Should one augment their approach when the sizes of the data returns grows?

Any help is greatly appreciated.

-Brandon

P.S. To add a final twist to all of this, I'm attempting to get this into ISIM. I understand all the complexities that that entails; no questions pertaining to that.
b***@gmail.com
2017-08-22 17:27:01 UTC
Permalink
Post by b***@gmail.com
Hello,
Been an avid TDI/SDI'er for quite some time now, but running up against a wall on this particular endeavor. I have a REST API service that allows me to pull all users of a given environment. The return is of a JSON string.
Now, normally, I'd just use a standard iterator in this case and augment it to what I need, but REST seems to make my eyes glaze over from time to time. In this case, the REST API requires extra header information than what is available for an HTTPClient connector in iterator mode.
I have tried two options, building my custom connector in iterator mode (which "works") or copying the output JSON response to a file and then attempting to iterate the file. Note that the size of this JSON is 20MB when stored in a flat file.
My own connector initializes and performs entry selection. The API is returning data in chunked encoding, so I have to wait for all entries to be stored in memory and then begin iteration. However, TDI does not like that after a few seconds in trying to iterate the finalized response. I guess that makes sense as it trying to store a massive JSON string in memory and then parse it.
Alternatively, if I run the flat file connector, I still chew up a fair amount of memory. Still working on the iteration piece to see if it's viable.
I guess my question boils down to is if either way is viable due to the size of the JSON string returned? Should one augment their approach when the sizes of the data returns grows?
Any help is greatly appreciated.
-Brandon
P.S. To add a final twist to all of this, I'm attempting to get this into ISIM. I understand all the complexities that that entails; no questions pertaining to that.
I might have been making this more complicated than it should be.

My initial goal was to be able to stream hierarchical data sent from a REST resource into flattened TDI entries. However, I'm not sure if I can "see" the hierarchical data as it's being sent to me because of the chunked nature of the stream (i.e. constantly incomplete until the full data set has been returned).

Thus, if I do a callReply and pull response down to a file. Then, initialize and run a formEntry connector and a bufferedReader of that file, and then parse what I need to parse. I think I can get away from performance problems. I think parsing data sent by the buffered reader with the formEntry iterator will give me the cyclical nature I need (for ISIM) coupled with the keeping things clean across the processing of the AL.

I still am interested in developing a custom connector for this, but time is short for what I need.

Always an adventure :)

-Brandon
Eddie Hartman
2017-08-22 19:12:11 UTC
Permalink
Post by b***@gmail.com
Post by b***@gmail.com
Hello,
Been an avid TDI/SDI'er for quite some time now, but running up against a wall on this particular endeavor. I have a REST API service that allows me to pull all users of a given environment. The return is of a JSON string.
Now, normally, I'd just use a standard iterator in this case and augment it to what I need, but REST seems to make my eyes glaze over from time to time. In this case, the REST API requires extra header information than what is available for an HTTPClient connector in iterator mode.
I have tried two options, building my custom connector in iterator mode (which "works") or copying the output JSON response to a file and then attempting to iterate the file. Note that the size of this JSON is 20MB when stored in a flat file.
My own connector initializes and performs entry selection. The API is returning data in chunked encoding, so I have to wait for all entries to be stored in memory and then begin iteration. However, TDI does not like that after a few seconds in trying to iterate the finalized response. I guess that makes sense as it trying to store a massive JSON string in memory and then parse it.
Alternatively, if I run the flat file connector, I still chew up a fair amount of memory. Still working on the iteration piece to see if it's viable.
I guess my question boils down to is if either way is viable due to the size of the JSON string returned? Should one augment their approach when the sizes of the data returns grows?
Any help is greatly appreciated.
-Brandon
P.S. To add a final twist to all of this, I'm attempting to get this into ISIM. I understand all the complexities that that entails; no questions pertaining to that.
I might have been making this more complicated than it should be.
My initial goal was to be able to stream hierarchical data sent from a REST resource into flattened TDI entries. However, I'm not sure if I can "see" the hierarchical data as it's being sent to me because of the chunked nature of the stream (i.e. constantly incomplete until the full data set has been returned).
Thus, if I do a callReply and pull response down to a file. Then, initialize and run a formEntry connector and a bufferedReader of that file, and then parse what I need to parse. I think I can get away from performance problems. I think parsing data sent by the buffered reader with the formEntry iterator will give me the cyclical nature I need (for ISIM) coupled with the keeping things clean across the processing of the AL.
I still am interested in developing a custom connector for this, but time is short for what I need.
Always an adventure :)
-Brandon
I'd do this in script myself, but that's just me and my love of Javascript, Brandon :)

http = system.getConnector("ibmdi.HTTPClient");

All the headers you need to set, and the url and method, can be done by setting up attributes in the Entry you pass to callReply (or queryReply as the method is called).

e = system.newEntry();
e.url = "http://.....";
e.method = "GET"; // this is default, so unnecessary
e["http.accept"] = "application/json";
e["http.cookie"] = ....

Then you can have a loop that makes the call to get the chunked return until you get the last of it, and simply append it to, for example, a StringBuffer.

strbuf = new java.lang.StringBuffer();

do {
retE = http.queryReply(e);
if (retE != null) {
strbuf.append(retE.getString("http.bodyAsString"));
}
} while (retE != null);

jobj = fromJson(strbuf.toString());

And violá you have your full payload. If you do this in a scripted connector's selectEntries(), then in getNextEntry() you can return one object at a time in the payload, flattened to the 'entry' object (which is 'conn' for the Connector). Here's a vid on scripting a connector if you're unfamiliar with this:



And here are the docs on Script Connectors:

https://www.ibm.com/support/knowledgecenter/en/SSCQGF_7.1.0/com.ibm.IBMDI.doc_7.1/referenceguide54.htm

Break a leg!
-Eddie
b***@gmail.com
2017-08-23 12:53:22 UTC
Permalink
Post by Eddie Hartman
Post by b***@gmail.com
Post by b***@gmail.com
Hello,
Been an avid TDI/SDI'er for quite some time now, but running up against a wall on this particular endeavor. I have a REST API service that allows me to pull all users of a given environment. The return is of a JSON string.
Now, normally, I'd just use a standard iterator in this case and augment it to what I need, but REST seems to make my eyes glaze over from time to time. In this case, the REST API requires extra header information than what is available for an HTTPClient connector in iterator mode.
I have tried two options, building my custom connector in iterator mode (which "works") or copying the output JSON response to a file and then attempting to iterate the file. Note that the size of this JSON is 20MB when stored in a flat file.
My own connector initializes and performs entry selection. The API is returning data in chunked encoding, so I have to wait for all entries to be stored in memory and then begin iteration. However, TDI does not like that after a few seconds in trying to iterate the finalized response. I guess that makes sense as it trying to store a massive JSON string in memory and then parse it.
Alternatively, if I run the flat file connector, I still chew up a fair amount of memory. Still working on the iteration piece to see if it's viable.
I guess my question boils down to is if either way is viable due to the size of the JSON string returned? Should one augment their approach when the sizes of the data returns grows?
Any help is greatly appreciated.
-Brandon
P.S. To add a final twist to all of this, I'm attempting to get this into ISIM. I understand all the complexities that that entails; no questions pertaining to that.
I might have been making this more complicated than it should be.
My initial goal was to be able to stream hierarchical data sent from a REST resource into flattened TDI entries. However, I'm not sure if I can "see" the hierarchical data as it's being sent to me because of the chunked nature of the stream (i.e. constantly incomplete until the full data set has been returned).
Thus, if I do a callReply and pull response down to a file. Then, initialize and run a formEntry connector and a bufferedReader of that file, and then parse what I need to parse. I think I can get away from performance problems. I think parsing data sent by the buffered reader with the formEntry iterator will give me the cyclical nature I need (for ISIM) coupled with the keeping things clean across the processing of the AL.
I still am interested in developing a custom connector for this, but time is short for what I need.
Always an adventure :)
-Brandon
I'd do this in script myself, but that's just me and my love of Javascript, Brandon :)
http = system.getConnector("ibmdi.HTTPClient");
All the headers you need to set, and the url and method, can be done by setting up attributes in the Entry you pass to callReply (or queryReply as the method is called).
e = system.newEntry();
e.url = "http://.....";
e.method = "GET"; // this is default, so unnecessary
e["http.accept"] = "application/json";
e["http.cookie"] = ....
Then you can have a loop that makes the call to get the chunked return until you get the last of it, and simply append it to, for example, a StringBuffer.
strbuf = new java.lang.StringBuffer();
do {
retE = http.queryReply(e);
if (retE != null) {
strbuf.append(retE.getString("http.bodyAsString"));
}
} while (retE != null);
jobj = fromJson(strbuf.toString());
http://youtu.be/McrCQtQvwlY
https://www.ibm.com/support/knowledgecenter/en/SSCQGF_7.1.0/com.ibm.IBMDI.doc_7.1/referenceguide54.htm
Break a leg!
-Eddie
Thank You, Eddie!

I'll try this as well. I am all for removing extra/unnecessary steps!

-Brandon
b***@gmail.com
2017-08-29 18:24:20 UTC
Permalink
Post by b***@gmail.com
Post by Eddie Hartman
Post by b***@gmail.com
Post by b***@gmail.com
Hello,
Been an avid TDI/SDI'er for quite some time now, but running up against a wall on this particular endeavor. I have a REST API service that allows me to pull all users of a given environment. The return is of a JSON string.
Now, normally, I'd just use a standard iterator in this case and augment it to what I need, but REST seems to make my eyes glaze over from time to time. In this case, the REST API requires extra header information than what is available for an HTTPClient connector in iterator mode.
I have tried two options, building my custom connector in iterator mode (which "works") or copying the output JSON response to a file and then attempting to iterate the file. Note that the size of this JSON is 20MB when stored in a flat file.
My own connector initializes and performs entry selection. The API is returning data in chunked encoding, so I have to wait for all entries to be stored in memory and then begin iteration. However, TDI does not like that after a few seconds in trying to iterate the finalized response. I guess that makes sense as it trying to store a massive JSON string in memory and then parse it.
Alternatively, if I run the flat file connector, I still chew up a fair amount of memory. Still working on the iteration piece to see if it's viable.
I guess my question boils down to is if either way is viable due to the size of the JSON string returned? Should one augment their approach when the sizes of the data returns grows?
Any help is greatly appreciated.
-Brandon
P.S. To add a final twist to all of this, I'm attempting to get this into ISIM. I understand all the complexities that that entails; no questions pertaining to that.
I might have been making this more complicated than it should be.
My initial goal was to be able to stream hierarchical data sent from a REST resource into flattened TDI entries. However, I'm not sure if I can "see" the hierarchical data as it's being sent to me because of the chunked nature of the stream (i.e. constantly incomplete until the full data set has been returned).
Thus, if I do a callReply and pull response down to a file. Then, initialize and run a formEntry connector and a bufferedReader of that file, and then parse what I need to parse. I think I can get away from performance problems. I think parsing data sent by the buffered reader with the formEntry iterator will give me the cyclical nature I need (for ISIM) coupled with the keeping things clean across the processing of the AL.
I still am interested in developing a custom connector for this, but time is short for what I need.
Always an adventure :)
-Brandon
I'd do this in script myself, but that's just me and my love of Javascript, Brandon :)
http = system.getConnector("ibmdi.HTTPClient");
All the headers you need to set, and the url and method, can be done by setting up attributes in the Entry you pass to callReply (or queryReply as the method is called).
e = system.newEntry();
e.url = "http://.....";
e.method = "GET"; // this is default, so unnecessary
e["http.accept"] = "application/json";
e["http.cookie"] = ....
Then you can have a loop that makes the call to get the chunked return until you get the last of it, and simply append it to, for example, a StringBuffer.
strbuf = new java.lang.StringBuffer();
do {
retE = http.queryReply(e);
if (retE != null) {
strbuf.append(retE.getString("http.bodyAsString"));
}
} while (retE != null);
jobj = fromJson(strbuf.toString());
http://youtu.be/McrCQtQvwlY
https://www.ibm.com/support/knowledgecenter/en/SSCQGF_7.1.0/com.ibm.IBMDI.doc_7.1/referenceguide54.htm
Break a leg!
-Eddie
Thank You, Eddie!
I'll try this as well. I am all for removing extra/unnecessary steps!
-Brandon
A few things I'd like to add as I explored way more paths to the desired end than I intended to in the past few days.

1. Tangentially related - If you are developing an adapter for ISIM and using .setParam() functionality to change the configuration of the connector. Set it's initialization state to something other than "at startup." This sent me down a spiral of code hell as the REST resource was telling me failure messages in the authentication mechanism (it's more complex auth mechanism - more than username/password) that relies on matching data points across multiple headers and the URL. I noticed that the URL that the connector was initializing to was different than what I was attempting to override with .setParam() (and, sequentially, a .getParam("url") said it was). Either I have grossly overlooked on how the Prolog hooks interact with the the initialization setting, or something is amiss in general. Setting it to "only when used" resolved that problem.
2. Took me a bit (and some debugging statements) to understand that configuration of username and password on the HTTP Client Connector is not what is passed to the resource. That is http.remote_user/http.remote_pass. Wasn't expecting that to be explicitly stated in your post, Eddie, just figured I'd state that because the it's misleading in the connector parameters.
3. Parsers attached to the HTTPClient Connector do very weird things to your request. This is especially true on callReply where the headers need to be of a certain format. If you are handling the format explicitly, avoid setting the parser on this connector. Some probably know this well, but for those starting out, start by handling things explicitly.
4. #3 has the downside that now you have to parse the return on a callReply. You'll get the string from http.bodyAsString and if you know the type, you can parse as needed - you'll have to do it yourself.
5. I tried your approach, Eddie, but found that it trashed my heap space (I even expanded it in the JVM parameters). I'm curious if you choose StringBuffer purposefully or not. StringBuffer is thread safe with the sacrifice to performance. StringBuilder() would be better because we're only dealing with one thread. That said, I didn't have much luck with StringBuilder() either; ran out of heap space on that too. However, BufferedReader() works just fine. I can't tell if it's the inherent mechanisms of BufferedReader vs StringBuilder and StringBuffer or just the way I'm structuring this. Nevertheless, my procedures work like this:
- Intialize Iterator and call a passive callReply connector
- callReply connector will query REST resource for all accounts and copy its contents to outbody path
- Iterator initializes BufferedReader() and reads the line (REST resource doesn't split the JSON data into lines - it's just one 20+million character line :| )
- flatten JSON data into attributes that I need and map appropriately.

There are other things that I played around with - ScriptParser, ScriptConnector, ParserFunctionComponent (which, btw, does not work appropriately in Adapter Development Tool - it's completely useless on its own because you can't specify the parser you need in its configuration - you have to script it). Custom connectors, custom parsers, custom functions and methods. I think I finally have what I need.

It's been a journey :)

Thanks again!
Eddie Hartman
2017-08-30 19:45:45 UTC
Permalink
Post by b***@gmail.com
Post by b***@gmail.com
Post by Eddie Hartman
Post by b***@gmail.com
Post by b***@gmail.com
Hello,
Been an avid TDI/SDI'er for quite some time now, but running up against a wall on this particular endeavor. I have a REST API service that allows me to pull all users of a given environment. The return is of a JSON string.
Now, normally, I'd just use a standard iterator in this case and augment it to what I need, but REST seems to make my eyes glaze over from time to time. In this case, the REST API requires extra header information than what is available for an HTTPClient connector in iterator mode.
I have tried two options, building my custom connector in iterator mode (which "works") or copying the output JSON response to a file and then attempting to iterate the file. Note that the size of this JSON is 20MB when stored in a flat file.
My own connector initializes and performs entry selection. The API is returning data in chunked encoding, so I have to wait for all entries to be stored in memory and then begin iteration. However, TDI does not like that after a few seconds in trying to iterate the finalized response. I guess that makes sense as it trying to store a massive JSON string in memory and then parse it.
Alternatively, if I run the flat file connector, I still chew up a fair amount of memory. Still working on the iteration piece to see if it's viable.
I guess my question boils down to is if either way is viable due to the size of the JSON string returned? Should one augment their approach when the sizes of the data returns grows?
Any help is greatly appreciated.
-Brandon
P.S. To add a final twist to all of this, I'm attempting to get this into ISIM. I understand all the complexities that that entails; no questions pertaining to that.
I might have been making this more complicated than it should be.
My initial goal was to be able to stream hierarchical data sent from a REST resource into flattened TDI entries. However, I'm not sure if I can "see" the hierarchical data as it's being sent to me because of the chunked nature of the stream (i.e. constantly incomplete until the full data set has been returned).
Thus, if I do a callReply and pull response down to a file. Then, initialize and run a formEntry connector and a bufferedReader of that file, and then parse what I need to parse. I think I can get away from performance problems. I think parsing data sent by the buffered reader with the formEntry iterator will give me the cyclical nature I need (for ISIM) coupled with the keeping things clean across the processing of the AL.
I still am interested in developing a custom connector for this, but time is short for what I need.
Always an adventure :)
-Brandon
I'd do this in script myself, but that's just me and my love of Javascript, Brandon :)
http = system.getConnector("ibmdi.HTTPClient");
All the headers you need to set, and the url and method, can be done by setting up attributes in the Entry you pass to callReply (or queryReply as the method is called).
e = system.newEntry();
e.url = "http://.....";
e.method = "GET"; // this is default, so unnecessary
e["http.accept"] = "application/json";
e["http.cookie"] = ....
Then you can have a loop that makes the call to get the chunked return until you get the last of it, and simply append it to, for example, a StringBuffer.
strbuf = new java.lang.StringBuffer();
do {
retE = http.queryReply(e);
if (retE != null) {
strbuf.append(retE.getString("http.bodyAsString"));
}
} while (retE != null);
jobj = fromJson(strbuf.toString());
http://youtu.be/McrCQtQvwlY
https://www.ibm.com/support/knowledgecenter/en/SSCQGF_7.1.0/com.ibm.IBMDI.doc_7.1/referenceguide54.htm
Break a leg!
-Eddie
Thank You, Eddie!
I'll try this as well. I am all for removing extra/unnecessary steps!
-Brandon
A few things I'd like to add as I explored way more paths to the desired end than I intended to in the past few days.
1. Tangentially related - If you are developing an adapter for ISIM and using .setParam() functionality to change the configuration of the connector. Set it's initialization state to something other than "at startup." This sent me down a spiral of code hell as the REST resource was telling me failure messages in the authentication mechanism (it's more complex auth mechanism - more than username/password) that relies on matching data points across multiple headers and the URL. I noticed that the URL that the connector was initializing to was different than what I was attempting to override with .setParam() (and, sequentially, a .getParam("url") said it was). Either I have grossly overlooked on how the Prolog hooks interact with the the initialization setting, or something is amiss in general. Setting it to "only when used" resolved that problem.
2. Took me a bit (and some debugging statements) to understand that configuration of username and password on the HTTP Client Connector is not what is passed to the resource. That is http.remote_user/http.remote_pass. Wasn't expecting that to be explicitly stated in your post, Eddie, just figured I'd state that because the it's misleading in the connector parameters.
3. Parsers attached to the HTTPClient Connector do very weird things to your request. This is especially true on callReply where the headers need to be of a certain format. If you are handling the format explicitly, avoid setting the parser on this connector. Some probably know this well, but for those starting out, start by handling things explicitly.
4. #3 has the downside that now you have to parse the return on a callReply. You'll get the string from http.bodyAsString and if you know the type, you can parse as needed - you'll have to do it yourself.
- Intialize Iterator and call a passive callReply connector
- callReply connector will query REST resource for all accounts and copy its contents to outbody path
- Iterator initializes BufferedReader() and reads the line (REST resource doesn't split the JSON data into lines - it's just one 20+million character line :| )
- flatten JSON data into attributes that I need and map appropriately.
There are other things that I played around with - ScriptParser, ScriptConnector, ParserFunctionComponent (which, btw, does not work appropriately in Adapter Development Tool - it's completely useless on its own because you can't specify the parser you need in its configuration - you have to script it). Custom connectors, custom parsers, custom functions and methods. I think I finally have what I need.
It's been a journey :)
Thanks again!
You are absolutely right in that you don't want to tie Parsers directly to your HTTP Client Connector. Do it after the callReply. And for massive return payloads then I have to defer to your experience. I've not had problems with StringBuffer, but then I'm no Java expert. Just copying what I've seen others do in Java. The BufferedReader seems like a good approach. Of course, you'll still have to accumulate the payload in order to fromJson() it.

And in my limited experience, there's a lot that does not work under the rule of the Dispatcher. Thanks for sharing your learning!

-Eddie

Loading...